Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broker hangs and crashes when listing non-persistent topics #5417

Closed
digi691 opened this issue Oct 18, 2019 · 19 comments
Closed

Broker hangs and crashes when listing non-persistent topics #5417

digi691 opened this issue Oct 18, 2019 · 19 comments
Labels
release/2.7.1 type/bug The PR fixed a bug or issue reported a bug

Comments

@digi691
Copy link

digi691 commented Oct 18, 2019

Describe the bug
On a Pulsar cluster w/ versions 2.3.0 or 2.4.1 when I send the API request /admin/v2/non-persistent/{tenant}/{namespace} to one of my brokers, the request just hangs. If I send this request to frequently, the broker's API becomes unresponsive until the broker is restarted. The broker also never logs the GET request.

To Reproduce
Steps to reproduce the behavior:

  1. Run step 2 on a pulsar broker that is in a cluster
  2. curl -v http://{HOST}:8080/admin/v2/non-persistent/{tenant}/{namespace}

Expected behavior
I expect pulsar to return an empty list.

Additional context
I could not reproduce on a standalone instance but the behavior is present on two of our clusters. We do not use non-persistent topics but when trying to use Presto against Pulsar, the Presto jar tries to arbitrarily list non-persistent topics.

@vicaya
Copy link
Contributor

vicaya commented Oct 31, 2019

We're seeing this in 2.4.0 clusters as well. I suspect that bumping the jetty thread count would help. Let's make the jetty thread count configurable.

@digi691
Copy link
Author

digi691 commented Nov 1, 2019

Currently numHttpServerThreads=8 is set to what #3776 suggests.

@jiazhai
Copy link
Member

jiazhai commented Dec 3, 2019

@digi691 increase the numHttpServerThreads could solve this issue?

@tuteng
Copy link
Member

tuteng commented Dec 3, 2019

I understand that it is possible to solve this problem by increasing the number of threads of the parameter numHttpServerThreads. If the amount of concurrency for HTTP requests is large, we should increase the value of this parameter. We may need to add documents to illustrate the HTTP request problem under the condition of high concurrency.

@digi691
Copy link
Author

digi691 commented Dec 3, 2019

What is the appropriate number? Like I said earlier, I currently have it set to 8. Which is the suggested default. What I don't understand is that if I make any other requests on the API, it finishes fine. It's only this endpoint that causes the whole broker to crash. I don't have proof of what is going on, though it seems like when I hit that endpoint it ties up a HttpServerThread indefinitely.

@tuteng
Copy link
Member

tuteng commented Dec 4, 2019

Thank you for your reply. I will test it under the cluster and try to fix this problem.

@tuteng
Copy link
Member

tuteng commented Dec 6, 2019

I tested this problem. under a cluster (including three brokers and three bookies), I used the ab tool to test. numHttpServerThreads defaults to 8. when the number of query requests initiated is significantly higher than the number that the thread can handle, it will indeed block the broker.

I think this is because non-persisten topic is stored in memory. when there are multiple brokers, your request is sent to one of them, and the broker completes the remaining work. after the broker receives the request, it will forward the request to all brokers under the cluster. Look up all bundles of all broker, then traverse all topics under these bundles, and finally return topics. A large part of this operation is network delay. Therefore, when concurrent requests are large, the above problems will occur. I have not found a suitable way to solve this problem through fix code, but I have two alternatives to query non-persistent topic.

  1. Reasonably evaluate your query request and configure the appropriate number of numHttpServerThreads, but this method cannot completely solve the above problems.

  2. The proposal is divided into the following three steps

a. Get all broker addresses

curl -v http://any-broker-ip:8080/admin/v2/brokers/cluster-name

b. Loop the following rest API to obtain bundle

curl -v http://broker-1:8080/admin/v2/non-persistent/test/test-namespace
curl -v http://broker-2:8080/admin/v2/non-persistent/test/test-namespace

The following results will be returned

"bundles" : {
    "boundaries" : [ "0x00000000", "0x40000000", "0x80000000", "0xc0000000", "0xffffffff" ],
    "numBundles" : 4
  },

c. Get the topic under the bundle through the bundle.

...
http://broker-ip:8080/admin/v2/non-persistent/test/test-namespace/0x00000000_0x40000000
http://broker-ip:8080/admin/v2/non-persistent/test/test-namespace/0x40000000_0x80000000
...

I think the second method will solve the problem of blocking the broker due to the high number of concurrent requests.

@tuteng
Copy link
Member

tuteng commented Dec 7, 2019

I think this issue can be closed, if there are any problems, we can consider open it back on.

@digi691
Copy link
Author

digi691 commented Dec 9, 2019

@tuteng will talk to my team about upgrading our test environment. Hopefully I will be able to get to testing 2.4.2 within the next couple of weeks.

@digi691
Copy link
Author

digi691 commented Jan 7, 2020

@tuteng I was able to upgrade our dev/test Pulsar cluster to version 2.4.2. Now hitting the /admin/v2/non-persistent/{tenant}/{namespace} API endpoint causes a HTTP ERROR 504 when connecting through the proxy whether the namspace has non-persistent topics or not. When connecting directly to the broker at /admin/v2/non-persistent/{tenant}/{namespace} it seems to infinitely wait. The crashing of the broker API seems to be resolved, but still cannot hit this endpoint on a Pulsar cluster. When Pulsar is running in standalone mode, this produces an empty list.

@digi691
Copy link
Author

digi691 commented Jan 14, 2020

@tuteng Let me know how I should proceed or offer further help into figuring out this issue.

@tuteng
Copy link
Member

tuteng commented Jan 14, 2020

Can you give me a detailed description of your use of context information, such as whether to turn on authentication, the command used, the relevant configuration of the cluster, etc. @digi691

@digi691
Copy link
Author

digi691 commented Jan 17, 2020

@tuteng In my development environment I have two HA-Proxy servers that load balance between 2 - Pulsar proxy servers. I then have 3 brokers, these brokers then point to 4 bookkeepers. I also have 3 Zookeeper nodes. The difference between my development and production environments are that the development environment's configuration store is also the Zookeeper quorum. In my Production environment I have a separate quorum zookeeper cluster (3) and configuration store (3), and also have 8 Bookkeepers in production. Just to note, I am seeing this issue in our production instance as well. Here is the broker.conf I'm using on all three of the brokers and is pretty much the same in the production environment just different host names, cert names, and bucket names: https://gist.github.com/digi691/2a27c8a6055145e98450fc7efce8c0c4. FYI - I had to scrub the file for hostnames, Identifications in Cert Names, etc... As you'll notice: TLS is turned on though authentication is currently turned off. I timeout when going through the HA-Proxy and Pulsar Proxies as I would expect when hitting the admin api to list non-persistent volumes. When I point directly to the brokers and hit that that admin api, it just sits and waits forever as I explained above and never produces a response. I also cannot use the pulsar-admin tool sub command topics list as I believe it's trying to list non-persistent topics as well as persistent.

@digi691
Copy link
Author

digi691 commented Apr 29, 2020

Know this issue is stale though we are still having this issue on our bare metal clusters. I recently setup Pulsar in Kubernetes and I'm not seeing this behavior. Is there any kind of miss-configuration of the brokers and zookeepers that could cause the broker just to swallow these type of requests, not respond, and not log any errors about it?

@flowchartsman
Copy link
Contributor

flowchartsman commented Jan 10, 2021

I'm also seeing this issue on a cluster built around the pulsar-all image with docker. If I inspect the request with a proxy, pulsarctl and pulsar-admin both make two requests in the background:

admin/v2/persistent/<tenant>/<namespace>
admin/v2/non-persistent/<tenant>/<namespace>

The request to the persistent route returns a code 200 and a list of topics immediately. The request to the non-presistent route times out and the only an error is displayed by the client.

@codelipenghui
Copy link
Contributor

@flowchartsman I have tried to reproduce this problem, seems it's not easy to reproduce. Is it works when you list all topics through the broker directly? If it works, the problem might related to the Pulsar proxy, Otherwise, the problem should happens at the broker side. This will help us to locate the problem

@flowchartsman
Copy link
Contributor

I can confirm that this happens when accessing the broker directly and when accessing the proxy:

curl -v http://<broker_addr>:8080/admin/v2/non-persistent/tenantName/namespaceName
*   Trying <broker_addr>...
* TCP_NODELAY set
* Connected to <broker_addr> (<broker_addr>) port 8080 (#0)
> GET /admin/v2/non-persistent/tenantName/namespaceName HTTP/1.1
> Host: <broker_addr>:8080
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< Date: Mon, 11 Jan 2021 19:15:13 GMT
< broker-address: <broker_addr>
< Content-Type: text/plain
< Content-Length: 3183
< Server: Jetty(9.4.33.v20201020)
<

 --- An unexpected error occurred in the server ---

Message: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException

Stacktrace:

org.apache.pulsar.client.admin.PulsarAdminException: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
        at org.apache.pulsar.client.admin.internal.BaseResource.getApiException(BaseResource.java:231)
        at org.apache.pulsar.client.admin.internal.TopicsImpl$5.failed(TopicsImpl.java:233)
        at org.glassfish.jersey.client.JerseyInvocation$1.failed(JerseyInvocation.java:839)
        at org.glassfish.jersey.client.ClientRuntime.processFailure(ClientRuntime.java:247)
        at org.glassfish.jersey.client.ClientRuntime.processFailure(ClientRuntime.java:242)
        at org.glassfish.jersey.client.ClientRuntime.access$100(ClientRuntime.java:62)
        at org.glassfish.jersey.client.ClientRuntime$2.lambda$failure$1(ClientRuntime.java:178)
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
        at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
        at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
        at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
        at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:288)
        at org.glassfish.jersey.client.ClientRuntime$2.failure(ClientRuntime.java:178)
        at org.apache.pulsar.client.admin.internal.http.AsyncHttpConnector.lambda$apply$1(AsyncHttpConnector.java:200)
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
        at org.apache.pulsar.client.admin.internal.http.AsyncHttpConnector.lambda$timeoutAfter$7(AsyncHttpConnector.java:300)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at java.util.concurrent.CompletableFuture.orApply(CompletableFuture.java:1385)
        at java.util.concurrent.CompletableFuture$OrApply.tryFire(CompletableFuture.java:1364)
        at java.util.concurrent.CompletableFuture$CoCompletion.tryFire(CompletableFuture.java:1034)
        ... 10 more
Caused by: java.util.concurrent.TimeoutException
        ... 8 more
* Connection #0 to host <broker_addr> left intact
* Closing connection 0

@sijie
Copy link
Member

sijie commented Jan 21, 2021

This is fixed by #9228

@sijie sijie closed this as completed Jan 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release/2.7.1 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

7 participants