[Bug] /metrics endpoint processes requests one-by-one and seems to queue up waiting requests infinitely, ignoring the request timeout #22477

lhotari · 2024-04-10T17:16:32Z

Search before asking

I searched in the issues and found nothing similar.

Read release policy

I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

Applies to at least 3.0.x and 3.2.x versions of the Broker.

Minimal reproduce step

TBD. I believe this could be done with a http performance testing tool such as k6 or wrk2. Will follow up later. This might also require a large amount of topics so that the metrics size is significant.

This problem reproduces also with the metricsBufferResponse=true setting which is expected to cache the results and allow concurrent requests.

What did you expect to see?

Calling the /metrics endpoint should be possible concurrently and should generate the metrics results once and share it across the concurrent requests. The expiration of the cached result should be configurable.

What did you see instead?

requests queue up until there there's a failure about reaching the connection limit
INFO org.eclipse.jetty.server.ConnectionLimit - Connection Limit(2048) reached for [ServerConnector@.........{HTTP/1.1, (http/1.1)}{0.0.0.0:8080}]

Anything else?

The problem is partially mitigated by enabling Gzip compression with #21667, however it doesn't address the root cause.
There has been a previous attempt to address performance problems in the /metrics endpoint with #14453.

Are you willing to submit a PR?

I'm willing to submit a PR!

The text was updated successfully, but these errors were encountered:

lhotari · 2024-04-11T10:08:25Z

While trying to reproduce this issue, I ran into another issue where metrics stop working completely.
GET call to /metrics/ will return an empty response.

I have Pulsar 3.0.4 deployed using chart 3.4.0 rc1

helm repo add --force-update apache-pulsar-dist-dev https://dist.apache.org/repos/dist/dev/pulsar/helm-chart/3.4.0-candidate-1/
helm repo update
helm install -n pulsar pulsar apache-pulsar-dist-dev/pulsar --version 3.4.0 --set affinity.anti_affinity=false --set broker.replicaCount=1 --wait --debug

After this start port forwarding to broker pod:

kubectl port-forward -n pulsar $(kubectl get pods -n pulsar -l component=broker -o jsonpath='{.items[0].metadata.name}') 8080:8080

I create a large number of topics with very long tenant, namespace and topic names:

git clone https://github.com/lhotari/pulsar-playground
cd pulsar-playground
./gradlew shadowJar
# create the topics
java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioCreateLongNamedTopics
# ensure that all topics are loaded
java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioLoadAll

Sample topic name for a single topic is persistent://aVecxoNHhQeRENVQLqFKGgVhpVHrspZsbNrcZIYgiFgSGAXMgaqrygIxlpWeVJHUVAhJmlABdRJVUYDNrEPnrpIiFkPxiyjjxWwQCsUJTwhxLqIeOpmiZvudXVKwnVemuPxVaukUsOsrsSQeNdzrulFTkYbfKpPgIqEwYqNgpqHFeQFRPbMAXwhcCxpvSGRplDCIMMnxbLnSEOUbvPFnTKhjxPOqZUBqnjdXRIIsqLGsqTlEKoojLexBteBDmXgVhMxaUSjVrGUgWbkfqpgPEhknVrUnmpCoycGhApByzvXmHWZJcJhbnwFamawjHlvDzaeFlQLvyjqxXFSQqGQxjeGflzeSGypsLijZfNnKxOpTJZQkgWPKBQqFDGalBjZBgyhGiCoBNAxXgRpxUNBtKnLaeOZGSqlzWicaqQgBDteQiksNCUarRYSAKBXyoiYsjUaxiaosDXHsYcLLEUpjdFNOHpFPrWdejBHPFDLltdQhlzjwYBrRAkfFCuXgQKtlcPPfPausmeUvrwnQBrNFApLjPTUqufLnJrdCksSCcOVetltZaghNmLTchETGpPhuCGgAFzuZjIFqglZPVCBlRqjLaGYHtxnsANfVzUTavdiKgicKWHRdlofsMgDdIlGhowLunxiFgJjEQTtKtftUDVJrykrkqsLaqjZeEuCTMXTDDkKFvKAApPKxqoOkhiWYhWaHRwGWWRnoEpGPmbRGdRDqqtyfQSXrzlQceTYFeWFSUXIgvjavdjnqFZLjhqKKZGEIEgRQQWwxTKLEsUrWqdvYqVXnIZXfzXmKBoOZtyGvYRTbhZoLHwUImdaqjpNXCJjJgpOQefSMnzLrCxZZoLcjTaIlzWJovIEPVsEzqoQxznxPIcWGYDIeGotegFqGneLDeVvqPpwvhQkdJcEpoNDkNJSkKwQpOWRuxjIFQDBtkqmBDmSxuGxUpgkkOUstlHHGcpjNsuZAjfAHHNkRWHjOzFTQKryFNhQaKGzr005/YGKBcGdELHIcIRdpHllGnDxxFdbMvYENMsnVyaGrzCEQlClLzlbBQTyNtrysvLHboFnQtWrTWdJxMuiycgPInCuGzzITtjdLjCGRuYnPWugOGmnGDSpgKMaKXSlmxDSFZHdAZFLZljytpOVFYmJaCSjQYVhdVTkPPvHglVpyXRiOMpJfZHBqnWWEfHvxKguKTaHkwnWBbcvkwBowhwNhoDhQqvhmDSmLrBIVaCkufPpgBZGFmUjGuHQdPmfXiaLDvexPAuUuoSIChLDBsPNZEuddahamZOmRXBDCBNBTVFdimjHwoQmCbWFTbBjjhqBSHzTZqTbJJAzymaYlfMIgIhkSzrZVCQcDxebYTXIrEkIyQcXuMKAZSMxAJsJvzRDrcxoAObihSaUxRxFHvWVXOObjHlaGOGttfNOIDUcszFVKvQSgcsHBEuxzwRmDpMuGoWDnRMKzjUfYToJDLfKJxgDnKwcTnRmIOWINAqxCZiaeCcvdTToUoSXxPVRDaSYekHgGYLfPYUyGgFLHgpcRHJHZROPnsdxSFNAcYIvrbQswSEbWjADlsAYGMOimuIvcMshBrNNYtcdehCzzlzGuxPXsGcoDHzamPezRvfduoKSFSWCJJFuYxsBnGfEZOSfWFxtqWpSdcNnyPyrwVPNWyQaDWKCYQeIZbxXrPatPCPnlixzxwaxJJJphyKtplRnSpbDfDrASIAFExAyITwSMFRdfhELCcRVARtaRSXjEihdZaZmFUerIkdQDuMZaarWbWeThwEcDHPUatUXWHfTBJaCJgjJiCvwgSrrhqRKuCleJHxGxujqeLTNDvGfvDnxznUNsEPzkudWnovfoudDuavzBOpuRzoSkNJAJMzXfdShYNtukpuxfjKLwQGfxWCUBClcAOjsTMsQkCzNaFSQlukCmoginaKnAdGMFjPzUuFncfYmQWzzhBGhCWIRkCQYPEQiAFeSwlLYssjNcVHcacrxjGudnPDXGCybKhPtk008/MqmGcLFhesqqROTDnTBqolaXFeLDMahTOmpuxcYoLLTRekLSnIHsOdizlxidvuJuoGGzCBIamVoLbsiWJDvrujPXYObyCYsVEMAcswIkUrqTooVIcVCRuHYmtdrHdxPAiLgOWQKEHIHpPeFethszwWYZGSYwhyjcRhtUWfUNtztPEmiTOUzWkaPSAYTnqhfUhUZTCFjRYMsUDYcxHwPFJBwrnDtGZanKJUmowHQHRAFXWSVMTSCSeCABjEbBafSAtWnzyrahNlDWEyeamWWZTPrRczQQOEKyQFHmFStKKeILqliysnUQcWtuofVQZtKdMQJQIqdueNchYpIaSRiWlkyekOZVMOyDGeWVTaAFKxYzHHOPTqNtllnnIqZjPFHkBmFXRkgFUVSwONvuNdIIzITlmzlRBtCucTLdTdPdmDhKVGZEPUSaVmujfFQgSHjmsXmNRqVeoAFBVJvDYuCoctatemUDgkttMZHRCccfWLZAluYOvLyqtxxgMSyjsBWpEYktbgFjZDMxJkrCsbTUlslfzWRhlGkhsDtwNwColLMVZKIdSxdowUcMwFnlmuaHrIBgiITbfUichnESKgnWkfMGwbJTMTMNFoqdElriwJoZfPgqYIHKVlbKRnqXwaACdOGSyjVHYltgacxzqOKfAZQzpWtOfpHbYsuEFVBobuIQXcFYdCxLRVkvrDpZCsjegdppuoRKqdZWGBUUARUAKKXcDiWMvWWCxcSDeLxcKvAIGGeZMolpVPUAqcDBNfPphPkBskEFzqzgDYDOViLsIwylMohMYmWXQjCREBSczgmoIlbkfXEAhmwuWcVRNuKmbUVjHIuQIhdzunSFcBriEdHeujDQyqcEOYOVohupMVloQWlbrFDRUKuTYugyKverkGPKSEeXwbtcbVjqWaijnTFbtHHmQVkHJrkRcwfWfWtlDoKDgCNZGOhokAuTsaZVleulSWpsUICIlUPNGyVLuCBPUKTowVHzISLvctEQ010
This is to ensure that the topics consume a lot of bytes in the output.

Metrics stopped working:

❯ curl -v http://localhost:8080/metrics/
*   Trying [::1]:8080...
* Connected to localhost (::1) port 8080
> GET /metrics/ HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/8.4.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Thu, 11 Apr 2024 10:03:10 GMT
< Content-Type: text/plain;charset=utf-8
< Content-Length: 0
< Server: Jetty(9.4.54.v20240208)
<
* Connection #0 to host localhost left intact

lhotari · 2024-04-11T10:21:39Z

In the above case, I noticed that the default scraping interval was 10 seconds for Prometheus in the Apache Pulsar Helm chart. It's possible that the scraping was part of how the broker got into a bad state and the metrics response is empty. It confirms that the current solution has multiple issue.

lhotari · 2024-04-11T10:26:51Z

With the topics created with java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioCreateLongNamedTopics, this /metrics/ empty response issue reproduces consistently.
After a broker restart, the problem starts happening after triggering loading of all topics with java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioLoadAll

lhotari · 2024-04-11T10:34:55Z

Ran out of heap in the test, trying with more heap. Also got java.io.IOException: Packet len 15589885 is out of range!, increasing jute.maxbuffer:

affinity:
  anti_affinity: false
bookkeeper:
  resources:
    requests:
      memory: 800m
      cpu: 1
  configData:
    PULSAR_EXTRA_OPTS: >
      -Djute.maxbuffer=20000000
    PULSAR_MEM: >
      -Xms128m -Xmx400m -XX:MaxDirectMemorySize=300m
broker:
  replicaCount: 1
  resources:
    requests:
      memory: 2500m
      cpu: 1
  configData:
    metricsBufferResponse: "true"
    zookeeperSessionExpiredPolicy: "shutdown"
    PULSAR_EXTRA_OPTS: >
      -Djute.maxbuffer=20000000
    PULSAR_MEM: >
      -Xms128m -Xmx1024m -XX:MaxDirectMemorySize=1024m
proxy:
  replicaCount: 1
zookeeper:
  resources:
    requests:
      memory: 600m
      cpu: 1
  configData:
    PULSAR_EXTRA_OPTS: >
      -Djute.maxbuffer=20000000
    PULSAR_MEM: >
      -Xms200m -Xmx500m
kube-prometheus-stack:
  prometheus-node-exporter:
    hostRootFsMount: 
      enabled: false

dao-jun · 2024-04-11T10:38:30Z

I think it might be caused by the long duration of scraping metrics? I remember some metrics need to read bookkeeper.

dao-jun · 2024-04-11T10:42:33Z

#14453 is introduced by me, It is for the purpose of improvement, in some cases, /metrics endpoint might be requested more than once, so I want to cache the metrics and return it directly if there are other requests in the period.

lhotari · 2024-04-11T10:56:07Z

#14453 is introduced by me, It is for the purpose of improvement, in some cases, /metrics endpoint might be requested more than once, so I want to cache the metrics and return it directly if there are other requests in the period.

@dao-jun yes, that helps to some level. I'll be refactoring and improving it further to address the problems that remain (for example, the infinite queuing of requests).

lhotari · 2024-04-11T11:18:58Z

storing this in load_test.js:

import http from 'k6/http';
import { sleep } from 'k6';

export let options = {
    vus: 100, // Number of virtual users
    iterations: 10000, // Total number of request iterations
};

export default function () {
    http.get('http://localhost:8080/metrics/');
}

running it with k6:

brew install k6
k6 run load_test.js

It reproduces the queuing issue, at least when metricsBufferResponse isn't enabled. There are multiple problems showing up in testing with Pulsar 3.0.4 in local k8s environment.

lhotari · 2024-04-11T16:49:34Z

I was able to stretch it more by running in standalone with more memory:

PULSAR_MEM="-Xms2g -Xmx4g -XX:MaxDirectMemorySize=6g" PULSAR_GC="-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+AlwaysPreTouch" PULSAR_EXTRA_OPTS="-Djute.maxbuffer=20000000" PULSAR_STANDALONE_USE_ZOOKEEPER=1 bin/pulsar standalone -nss -nfw 2>&1 | tee standalone.log

when running the k6 test, it can be seen that response times increase until it hits 30 seconds, which is the request timeout in k6.

the response time increases linearly:
582 1121 1577 2033 2499 2970 3440 3917 4398 4859 5330 5806 6269 6763 7199 7647 8097 8561 9015 9482 9916 10372 10820 11252 11708 12160 12669 13117 13554 13991 14504 14966 15411 15901 16364 16832 17270 17745 18215 18690 19157 19589 20055 20538 20974 21409 21884 22323 22787 23247 23750 24188 24629 25065 25531 26008 26593 27133 27723 28303 28879 29439

this can be seen in https://gist.github.com/lhotari/137be0ab5d16101f0a88e6628e1d34b3 .

It can be seen that even after k6 is closed, the request processing continues. On some of the orphan requests, the response size is reported to be the full size of the output. That means that the metrics has been re-generated for the queued and orphaned requests. For example 384122607 (384MB) is the response size of the full output.

2024-04-11T19:29:13,672+0300 [prometheus-stats-51-1] ERROR org.apache.pulsar.broker.stats.prometheus.PrometheusMetricsServlet - Failed to send metrics, likely the client or this server closed the connection due to a timeout (373 ms elapsed): org.eclipse.jetty.io.EofException
2024-04-11T19:29:13,672+0300 [prometheus-stats-51-1] INFO  org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [11/Apr/2024:19:28:48 +0300] "GET /metrics/ HTTP/1.1" 200 384122607 "-" "k6/0.50.0 (https://k6.io/)" 25073
2024-04-11T19:29:14,077+0300 [prometheus-stats-51-1] ERROR org.apache.pulsar.broker.stats.prometheus.PrometheusMetricsServlet - Failed to send metrics, likely the client or this server closed the connection due to a timeout (405 ms elapsed): org.eclipse.jetty.io.EofException
2024-04-11T19:29:14,077+0300 [prometheus-stats-51-1] INFO  org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [11/Apr/2024:19:28:49 +0300] "GET /metrics/ HTTP/1.1" 200 0 "-" "k6/0.50.0 (https://k6.io/)" 24889

lhotari · 2024-04-11T17:03:46Z

Tested standalone with metricsBufferResponse=true optimization created in #14453.

PULSAR_PREFIX_metricsBufferResponse=true docker/pulsar/scripts/apply-config-from-env.py conf/standalone.conf
PULSAR_MEM="-Xms2g -Xmx4g -XX:MaxDirectMemorySize=6g" PULSAR_GC="-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+AlwaysPreTouch" PULSAR_EXTRA_OPTS="-Djute.maxbuffer=20000000" PULSAR_STANDALONE_USE_ZOOKEEPER=1 bin/pulsar standalone -nss -nfw 2>&1 | tee standalone.log

It seems that it's able to keep up even with the 100 virtual user k6 test introduced above.

frankjkelly · 2024-04-11T17:10:30Z

Very cool - I was not aware of that setting.
That said the current description for that parameter isn't helpful in understanding just how useful it can be :-) https://pulsar.apache.org/reference/#/next/config/reference-configuration-broker?id=optional
If true, export buffered metrics

lhotari · 2024-04-11T17:19:30Z

the complete 10000 request test passed, downloading 3.6TB of metrics data in under 10 minutes (this was with the local Pulsar standalone on a Mac M3 Max).

❯ k6 run load_test.js

          /\      |‾‾| /‾‾/   /‾‾/
     /\  /  \     |  |/  /   /  /
    /  \/    \    |     (   /   ‾‾\
   /          \   |  |\  \ |  (‾)  |
  / __________ \  |__| \__\ \_____/ .io

     execution: local
        script: load_test.js
        output: -

     scenarios: (100.00%) 1 scenario, 100 max VUs, 10m30s max duration (incl. graceful stop):
              * default: 10000 iterations shared among 100 VUs (maxDuration: 10m0s, gracefulStop: 30s)


     data_received..................: 3.6 TB 6.2 GB/s
     data_sent......................: 880 kB 1.5 kB/s
     http_req_blocked...............: avg=44.33µs min=1µs      med=3µs     max=5.5ms    p(90)=4.1µs   p(95)=6µs
     http_req_connecting............: avg=35.38µs min=0s       med=0s      max=4.19ms   p(90)=0s      p(95)=0s
     http_req_duration..............: avg=5.74s   min=196.38ms med=5.74s   max=6.34s    p(90)=5.99s   p(95)=6.14s
       { expected_response:true }...: avg=5.74s   min=196.38ms med=5.74s   max=6.34s    p(90)=5.99s   p(95)=6.14s
     http_req_failed................: 0.00%  ✓ 0         ✗ 10000
     http_req_receiving.............: avg=66.89ms min=47.39ms  med=67.33ms max=192.11ms p(90)=76.73ms p(95)=82.04ms
     http_req_sending...............: avg=24.05µs min=3µs      med=14µs    max=5.43ms   p(90)=26µs    p(95)=43µs
     http_req_tls_handshaking.......: avg=0s      min=0s       med=0s      max=0s       p(90)=0s      p(95)=0s
     http_req_waiting...............: avg=5.67s   min=4.25ms   med=5.67s   max=6.25s    p(90)=5.92s   p(95)=6.07s
     http_reqs......................: 10000  17.325729/s
     iteration_duration.............: avg=5.74s   min=197.65ms med=5.74s   max=6.34s    p(90)=5.99s   p(95)=6.14s
     iterations.....................: 10000  17.325729/s
     vus............................: 3      min=3       max=100
     vus_max........................: 100    min=100     max=100


running (09m37.2s), 000/100 VUs, 10000 complete and 0 interrupted iterations
default ✓ [======================================] 100 VUs  09m37.2s/10m0s  10000/10000 shared iters

dao-jun · 2024-04-12T08:13:26Z

Very cool - I was not aware of that setting. That said the current description for that parameter isn't helpful in understanding just how useful it can be :-) https://pulsar.apache.org/reference/#/next/config/reference-configuration-broker?id=optional If true, export buffered metrics

I'll improvement the description of the configuration field.

- related to apache/pulsar#22477

lhotari added the type/bug The PR fixed a bug or issue reported a bug label Apr 10, 2024

lhotari self-assigned this Apr 10, 2024

dao-jun mentioned this issue Apr 12, 2024

[improve] Make the config metricsBufferResponse description more effective #22490

Merged

15 tasks

lhotari added a commit to lhotari/pulsar-playground that referenced this issue Apr 12, 2024

Add files for working on /metrics performance

fb4731e

- related to apache/pulsar#22477

lhotari mentioned this issue Apr 12, 2024

[fix][broker] Optimize /metrics, fix unbounded request queue issue and fix race conditions in metricsBufferResponse mode #22494

Merged

4 tasks

merlimat closed this as completed in #22494 Apr 13, 2024

lhotari mentioned this issue Apr 23, 2024

[improve][meta] Log a warning when ZK batch fails with connectionloss #22566

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] /metrics endpoint processes requests one-by-one and seems to queue up waiting requests infinitely, ignoring the request timeout #22477

[Bug] /metrics endpoint processes requests one-by-one and seems to queue up waiting requests infinitely, ignoring the request timeout #22477

lhotari commented Apr 10, 2024 •

edited

Loading

lhotari commented Apr 11, 2024

lhotari commented Apr 11, 2024

lhotari commented Apr 11, 2024

lhotari commented Apr 11, 2024 •

edited

Loading

dao-jun commented Apr 11, 2024

dao-jun commented Apr 11, 2024

lhotari commented Apr 11, 2024

lhotari commented Apr 11, 2024 •

edited

Loading

lhotari commented Apr 11, 2024

lhotari commented Apr 11, 2024

frankjkelly commented Apr 11, 2024

lhotari commented Apr 11, 2024

dao-jun commented Apr 12, 2024

[Bug] /metrics endpoint processes requests one-by-one and seems to queue up waiting requests infinitely, ignoring the request timeout #22477

[Bug] /metrics endpoint processes requests one-by-one and seems to queue up waiting requests infinitely, ignoring the request timeout #22477

Comments

lhotari commented Apr 10, 2024 • edited Loading

Search before asking

Read release policy

Version

Minimal reproduce step

What did you expect to see?

What did you see instead?

Anything else?

Are you willing to submit a PR?

lhotari commented Apr 11, 2024

lhotari commented Apr 11, 2024

lhotari commented Apr 11, 2024

lhotari commented Apr 11, 2024 • edited Loading

dao-jun commented Apr 11, 2024

dao-jun commented Apr 11, 2024

lhotari commented Apr 11, 2024

lhotari commented Apr 11, 2024 • edited Loading

lhotari commented Apr 11, 2024

lhotari commented Apr 11, 2024

frankjkelly commented Apr 11, 2024

lhotari commented Apr 11, 2024

dao-jun commented Apr 12, 2024

lhotari commented Apr 10, 2024 •

edited

Loading

lhotari commented Apr 11, 2024 •

edited

Loading

lhotari commented Apr 11, 2024 •

edited

Loading