-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] /metrics endpoint processes requests one-by-one and seems to queue up waiting requests infinitely, ignoring the request timeout #22477
Comments
While trying to reproduce this issue, I ran into another issue where metrics stop working completely. I have Pulsar 3.0.4 deployed using chart 3.4.0 rc1 helm repo add --force-update apache-pulsar-dist-dev https://dist.apache.org/repos/dist/dev/pulsar/helm-chart/3.4.0-candidate-1/
helm repo update
helm install -n pulsar pulsar apache-pulsar-dist-dev/pulsar --version 3.4.0 --set affinity.anti_affinity=false --set broker.replicaCount=1 --wait --debug After this start port forwarding to broker pod: kubectl port-forward -n pulsar $(kubectl get pods -n pulsar -l component=broker -o jsonpath='{.items[0].metadata.name}') 8080:8080 I create a large number of topics with very long tenant, namespace and topic names:
Sample topic name for a single topic is Metrics stopped working:
|
In the above case, I noticed that the default scraping interval was 10 seconds for Prometheus in the Apache Pulsar Helm chart. It's possible that the scraping was part of how the broker got into a bad state and the metrics response is empty. It confirms that the current solution has multiple issue. |
With the topics created with |
Ran out of heap in the test, trying with more heap. Also got affinity:
anti_affinity: false
bookkeeper:
resources:
requests:
memory: 800m
cpu: 1
configData:
PULSAR_EXTRA_OPTS: >
-Djute.maxbuffer=20000000
PULSAR_MEM: >
-Xms128m -Xmx400m -XX:MaxDirectMemorySize=300m
broker:
replicaCount: 1
resources:
requests:
memory: 2500m
cpu: 1
configData:
metricsBufferResponse: "true"
zookeeperSessionExpiredPolicy: "shutdown"
PULSAR_EXTRA_OPTS: >
-Djute.maxbuffer=20000000
PULSAR_MEM: >
-Xms128m -Xmx1024m -XX:MaxDirectMemorySize=1024m
proxy:
replicaCount: 1
zookeeper:
resources:
requests:
memory: 600m
cpu: 1
configData:
PULSAR_EXTRA_OPTS: >
-Djute.maxbuffer=20000000
PULSAR_MEM: >
-Xms200m -Xmx500m
kube-prometheus-stack:
prometheus-node-exporter:
hostRootFsMount:
enabled: false |
I think it might be caused by the long duration of scraping metrics? I remember some metrics need to read bookkeeper. |
#14453 is introduced by me, It is for the purpose of improvement, in some cases, /metrics endpoint might be requested more than once, so I want to cache the metrics and return it directly if there are other requests in the period. |
@dao-jun yes, that helps to some level. I'll be refactoring and improving it further to address the problems that remain (for example, the infinite queuing of requests). |
storing this in import http from 'k6/http';
import { sleep } from 'k6';
export let options = {
vus: 100, // Number of virtual users
iterations: 10000, // Total number of request iterations
};
export default function () {
http.get('http://localhost:8080/metrics/');
} running it with k6: brew install k6
k6 run load_test.js It reproduces the queuing issue, at least when |
I was able to stretch it more by running in standalone with more memory: PULSAR_MEM="-Xms2g -Xmx4g -XX:MaxDirectMemorySize=6g" PULSAR_GC="-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+AlwaysPreTouch" PULSAR_EXTRA_OPTS="-Djute.maxbuffer=20000000" PULSAR_STANDALONE_USE_ZOOKEEPER=1 bin/pulsar standalone -nss -nfw 2>&1 | tee standalone.log when running the k6 test, it can be seen that response times increase until it hits 30 seconds, which is the request timeout in k6. the response time increases linearly: this can be seen in https://gist.github.com/lhotari/137be0ab5d16101f0a88e6628e1d34b3 . It can be seen that even after k6 is closed, the request processing continues. On some of the orphan requests, the response size is reported to be the full size of the output. That means that the metrics has been re-generated for the queued and orphaned requests. For example
|
Tested standalone with
It seems that it's able to keep up even with the 100 virtual user k6 test introduced above. |
Very cool - I was not aware of that setting. |
the complete 10000 request test passed, downloading 3.6TB of metrics data in under 10 minutes (this was with the local Pulsar standalone on a Mac M3 Max).
|
I'll improvement the description of the configuration field. |
Search before asking
Read release policy
Version
Applies to at least 3.0.x and 3.2.x versions of the Broker.
Minimal reproduce step
TBD. I believe this could be done with a http performance testing tool such as k6 or wrk2. Will follow up later. This might also require a large amount of topics so that the metrics size is significant.
This problem reproduces also with the
metricsBufferResponse=true
setting which is expected to cache the results and allow concurrent requests.What did you expect to see?
Calling the /metrics endpoint should be possible concurrently and should generate the metrics results once and share it across the concurrent requests. The expiration of the cached result should be configurable.
What did you see instead?
requests queue up until there there's a failure about reaching the connection limit
INFO org.eclipse.jetty.server.ConnectionLimit - Connection Limit(2048) reached for [ServerConnector@.........{HTTP/1.1, (http/1.1)}{0.0.0.0:8080}]
Anything else?
The problem is partially mitigated by enabling Gzip compression with #21667, however it doesn't address the root cause.
There has been a previous attempt to address performance problems in the
/metrics
endpoint with #14453.Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: