Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] /metrics endpoint processes requests one-by-one and seems to queue up waiting requests infinitely, ignoring the request timeout #22477

Closed
3 tasks done
lhotari opened this issue Apr 10, 2024 · 13 comments · Fixed by #22494
Assignees
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@lhotari
Copy link
Member

lhotari commented Apr 10, 2024

Search before asking

  • I searched in the issues and found nothing similar.

Read release policy

  • I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

Applies to at least 3.0.x and 3.2.x versions of the Broker.

Minimal reproduce step

TBD. I believe this could be done with a http performance testing tool such as k6 or wrk2. Will follow up later. This might also require a large amount of topics so that the metrics size is significant.

This problem reproduces also with the metricsBufferResponse=true setting which is expected to cache the results and allow concurrent requests.

What did you expect to see?

Calling the /metrics endpoint should be possible concurrently and should generate the metrics results once and share it across the concurrent requests. The expiration of the cached result should be configurable.

What did you see instead?

requests queue up until there there's a failure about reaching the connection limit
INFO org.eclipse.jetty.server.ConnectionLimit - Connection Limit(2048) reached for [ServerConnector@.........{HTTP/1.1, (http/1.1)}{0.0.0.0:8080}]

Anything else?

The problem is partially mitigated by enabling Gzip compression with #21667, however it doesn't address the root cause.
There has been a previous attempt to address performance problems in the /metrics endpoint with #14453.

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@lhotari lhotari added the type/bug The PR fixed a bug or issue reported a bug label Apr 10, 2024
@lhotari lhotari self-assigned this Apr 10, 2024
@lhotari
Copy link
Member Author

lhotari commented Apr 11, 2024

While trying to reproduce this issue, I ran into another issue where metrics stop working completely.
GET call to /metrics/ will return an empty response.

I have Pulsar 3.0.4 deployed using chart 3.4.0 rc1

helm repo add --force-update apache-pulsar-dist-dev https://dist.apache.org/repos/dist/dev/pulsar/helm-chart/3.4.0-candidate-1/
helm repo update
helm install -n pulsar pulsar apache-pulsar-dist-dev/pulsar --version 3.4.0 --set affinity.anti_affinity=false --set broker.replicaCount=1 --wait --debug

After this start port forwarding to broker pod:

kubectl port-forward -n pulsar $(kubectl get pods -n pulsar -l component=broker -o jsonpath='{.items[0].metadata.name}') 8080:8080

I create a large number of topics with very long tenant, namespace and topic names:

git clone https://github.com/lhotari/pulsar-playground
cd pulsar-playground
./gradlew shadowJar
# create the topics
java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioCreateLongNamedTopics
# ensure that all topics are loaded
java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioLoadAll

Sample topic name for a single topic is persistent://aVecxoNHhQeRENVQLqFKGgVhpVHrspZsbNrcZIYgiFgSGAXMgaqrygIxlpWeVJHUVAhJmlABdRJVUYDNrEPnrpIiFkPxiyjjxWwQCsUJTwhxLqIeOpmiZvudXVKwnVemuPxVaukUsOsrsSQeNdzrulFTkYbfKpPgIqEwYqNgpqHFeQFRPbMAXwhcCxpvSGRplDCIMMnxbLnSEOUbvPFnTKhjxPOqZUBqnjdXRIIsqLGsqTlEKoojLexBteBDmXgVhMxaUSjVrGUgWbkfqpgPEhknVrUnmpCoycGhApByzvXmHWZJcJhbnwFamawjHlvDzaeFlQLvyjqxXFSQqGQxjeGflzeSGypsLijZfNnKxOpTJZQkgWPKBQqFDGalBjZBgyhGiCoBNAxXgRpxUNBtKnLaeOZGSqlzWicaqQgBDteQiksNCUarRYSAKBXyoiYsjUaxiaosDXHsYcLLEUpjdFNOHpFPrWdejBHPFDLltdQhlzjwYBrRAkfFCuXgQKtlcPPfPausmeUvrwnQBrNFApLjPTUqufLnJrdCksSCcOVetltZaghNmLTchETGpPhuCGgAFzuZjIFqglZPVCBlRqjLaGYHtxnsANfVzUTavdiKgicKWHRdlofsMgDdIlGhowLunxiFgJjEQTtKtftUDVJrykrkqsLaqjZeEuCTMXTDDkKFvKAApPKxqoOkhiWYhWaHRwGWWRnoEpGPmbRGdRDqqtyfQSXrzlQceTYFeWFSUXIgvjavdjnqFZLjhqKKZGEIEgRQQWwxTKLEsUrWqdvYqVXnIZXfzXmKBoOZtyGvYRTbhZoLHwUImdaqjpNXCJjJgpOQefSMnzLrCxZZoLcjTaIlzWJovIEPVsEzqoQxznxPIcWGYDIeGotegFqGneLDeVvqPpwvhQkdJcEpoNDkNJSkKwQpOWRuxjIFQDBtkqmBDmSxuGxUpgkkOUstlHHGcpjNsuZAjfAHHNkRWHjOzFTQKryFNhQaKGzr005/YGKBcGdELHIcIRdpHllGnDxxFdbMvYENMsnVyaGrzCEQlClLzlbBQTyNtrysvLHboFnQtWrTWdJxMuiycgPInCuGzzITtjdLjCGRuYnPWugOGmnGDSpgKMaKXSlmxDSFZHdAZFLZljytpOVFYmJaCSjQYVhdVTkPPvHglVpyXRiOMpJfZHBqnWWEfHvxKguKTaHkwnWBbcvkwBowhwNhoDhQqvhmDSmLrBIVaCkufPpgBZGFmUjGuHQdPmfXiaLDvexPAuUuoSIChLDBsPNZEuddahamZOmRXBDCBNBTVFdimjHwoQmCbWFTbBjjhqBSHzTZqTbJJAzymaYlfMIgIhkSzrZVCQcDxebYTXIrEkIyQcXuMKAZSMxAJsJvzRDrcxoAObihSaUxRxFHvWVXOObjHlaGOGttfNOIDUcszFVKvQSgcsHBEuxzwRmDpMuGoWDnRMKzjUfYToJDLfKJxgDnKwcTnRmIOWINAqxCZiaeCcvdTToUoSXxPVRDaSYekHgGYLfPYUyGgFLHgpcRHJHZROPnsdxSFNAcYIvrbQswSEbWjADlsAYGMOimuIvcMshBrNNYtcdehCzzlzGuxPXsGcoDHzamPezRvfduoKSFSWCJJFuYxsBnGfEZOSfWFxtqWpSdcNnyPyrwVPNWyQaDWKCYQeIZbxXrPatPCPnlixzxwaxJJJphyKtplRnSpbDfDrASIAFExAyITwSMFRdfhELCcRVARtaRSXjEihdZaZmFUerIkdQDuMZaarWbWeThwEcDHPUatUXWHfTBJaCJgjJiCvwgSrrhqRKuCleJHxGxujqeLTNDvGfvDnxznUNsEPzkudWnovfoudDuavzBOpuRzoSkNJAJMzXfdShYNtukpuxfjKLwQGfxWCUBClcAOjsTMsQkCzNaFSQlukCmoginaKnAdGMFjPzUuFncfYmQWzzhBGhCWIRkCQYPEQiAFeSwlLYssjNcVHcacrxjGudnPDXGCybKhPtk008/MqmGcLFhesqqROTDnTBqolaXFeLDMahTOmpuxcYoLLTRekLSnIHsOdizlxidvuJuoGGzCBIamVoLbsiWJDvrujPXYObyCYsVEMAcswIkUrqTooVIcVCRuHYmtdrHdxPAiLgOWQKEHIHpPeFethszwWYZGSYwhyjcRhtUWfUNtztPEmiTOUzWkaPSAYTnqhfUhUZTCFjRYMsUDYcxHwPFJBwrnDtGZanKJUmowHQHRAFXWSVMTSCSeCABjEbBafSAtWnzyrahNlDWEyeamWWZTPrRczQQOEKyQFHmFStKKeILqliysnUQcWtuofVQZtKdMQJQIqdueNchYpIaSRiWlkyekOZVMOyDGeWVTaAFKxYzHHOPTqNtllnnIqZjPFHkBmFXRkgFUVSwONvuNdIIzITlmzlRBtCucTLdTdPdmDhKVGZEPUSaVmujfFQgSHjmsXmNRqVeoAFBVJvDYuCoctatemUDgkttMZHRCccfWLZAluYOvLyqtxxgMSyjsBWpEYktbgFjZDMxJkrCsbTUlslfzWRhlGkhsDtwNwColLMVZKIdSxdowUcMwFnlmuaHrIBgiITbfUichnESKgnWkfMGwbJTMTMNFoqdElriwJoZfPgqYIHKVlbKRnqXwaACdOGSyjVHYltgacxzqOKfAZQzpWtOfpHbYsuEFVBobuIQXcFYdCxLRVkvrDpZCsjegdppuoRKqdZWGBUUARUAKKXcDiWMvWWCxcSDeLxcKvAIGGeZMolpVPUAqcDBNfPphPkBskEFzqzgDYDOViLsIwylMohMYmWXQjCREBSczgmoIlbkfXEAhmwuWcVRNuKmbUVjHIuQIhdzunSFcBriEdHeujDQyqcEOYOVohupMVloQWlbrFDRUKuTYugyKverkGPKSEeXwbtcbVjqWaijnTFbtHHmQVkHJrkRcwfWfWtlDoKDgCNZGOhokAuTsaZVleulSWpsUICIlUPNGyVLuCBPUKTowVHzISLvctEQ010
This is to ensure that the topics consume a lot of bytes in the output.

Metrics stopped working:

❯ curl -v http://localhost:8080/metrics/
*   Trying [::1]:8080...
* Connected to localhost (::1) port 8080
> GET /metrics/ HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/8.4.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Thu, 11 Apr 2024 10:03:10 GMT
< Content-Type: text/plain;charset=utf-8
< Content-Length: 0
< Server: Jetty(9.4.54.v20240208)
<
* Connection #0 to host localhost left intact

@lhotari
Copy link
Member Author

lhotari commented Apr 11, 2024

In the above case, I noticed that the default scraping interval was 10 seconds for Prometheus in the Apache Pulsar Helm chart. It's possible that the scraping was part of how the broker got into a bad state and the metrics response is empty. It confirms that the current solution has multiple issue.

@lhotari
Copy link
Member Author

lhotari commented Apr 11, 2024

With the topics created with java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioCreateLongNamedTopics, this /metrics/ empty response issue reproduces consistently.
After a broker restart, the problem starts happening after triggering loading of all topics with java -cp build/libs/pulsar-playground-all.jar com.github.lhotari.pulsar.playground.TestScenarioLoadAll

@lhotari
Copy link
Member Author

lhotari commented Apr 11, 2024

Ran out of heap in the test, trying with more heap. Also got java.io.IOException: Packet len 15589885 is out of range!, increasing jute.maxbuffer:

affinity:
  anti_affinity: false
bookkeeper:
  resources:
    requests:
      memory: 800m
      cpu: 1
  configData:
    PULSAR_EXTRA_OPTS: >
      -Djute.maxbuffer=20000000
    PULSAR_MEM: >
      -Xms128m -Xmx400m -XX:MaxDirectMemorySize=300m
broker:
  replicaCount: 1
  resources:
    requests:
      memory: 2500m
      cpu: 1
  configData:
    metricsBufferResponse: "true"
    zookeeperSessionExpiredPolicy: "shutdown"
    PULSAR_EXTRA_OPTS: >
      -Djute.maxbuffer=20000000
    PULSAR_MEM: >
      -Xms128m -Xmx1024m -XX:MaxDirectMemorySize=1024m
proxy:
  replicaCount: 1
zookeeper:
  resources:
    requests:
      memory: 600m
      cpu: 1
  configData:
    PULSAR_EXTRA_OPTS: >
      -Djute.maxbuffer=20000000
    PULSAR_MEM: >
      -Xms200m -Xmx500m
kube-prometheus-stack:
  prometheus-node-exporter:
    hostRootFsMount: 
      enabled: false

@dao-jun
Copy link
Member

dao-jun commented Apr 11, 2024

I think it might be caused by the long duration of scraping metrics? I remember some metrics need to read bookkeeper.

@dao-jun
Copy link
Member

dao-jun commented Apr 11, 2024

#14453 is introduced by me, It is for the purpose of improvement, in some cases, /metrics endpoint might be requested more than once, so I want to cache the metrics and return it directly if there are other requests in the period.

@lhotari
Copy link
Member Author

lhotari commented Apr 11, 2024

#14453 is introduced by me, It is for the purpose of improvement, in some cases, /metrics endpoint might be requested more than once, so I want to cache the metrics and return it directly if there are other requests in the period.

@dao-jun yes, that helps to some level. I'll be refactoring and improving it further to address the problems that remain (for example, the infinite queuing of requests).

@lhotari
Copy link
Member Author

lhotari commented Apr 11, 2024

storing this in load_test.js:

import http from 'k6/http';
import { sleep } from 'k6';

export let options = {
    vus: 100, // Number of virtual users
    iterations: 10000, // Total number of request iterations
};

export default function () {
    http.get('http://localhost:8080/metrics/');
}

running it with k6:

brew install k6
k6 run load_test.js

It reproduces the queuing issue, at least when metricsBufferResponse isn't enabled. There are multiple problems showing up in testing with Pulsar 3.0.4 in local k8s environment.

@lhotari
Copy link
Member Author

lhotari commented Apr 11, 2024

I was able to stretch it more by running in standalone with more memory:

PULSAR_MEM="-Xms2g -Xmx4g -XX:MaxDirectMemorySize=6g" PULSAR_GC="-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+AlwaysPreTouch" PULSAR_EXTRA_OPTS="-Djute.maxbuffer=20000000" PULSAR_STANDALONE_USE_ZOOKEEPER=1 bin/pulsar standalone -nss -nfw 2>&1 | tee standalone.log

when running the k6 test, it can be seen that response times increase until it hits 30 seconds, which is the request timeout in k6.

the response time increases linearly:
582 1121 1577 2033 2499 2970 3440 3917 4398 4859 5330 5806 6269 6763 7199 7647 8097 8561 9015 9482 9916 10372 10820 11252 11708 12160 12669 13117 13554 13991 14504 14966 15411 15901 16364 16832 17270 17745 18215 18690 19157 19589 20055 20538 20974 21409 21884 22323 22787 23247 23750 24188 24629 25065 25531 26008 26593 27133 27723 28303 28879 29439

this can be seen in https://gist.github.com/lhotari/137be0ab5d16101f0a88e6628e1d34b3 .

It can be seen that even after k6 is closed, the request processing continues. On some of the orphan requests, the response size is reported to be the full size of the output. That means that the metrics has been re-generated for the queued and orphaned requests. For example 384122607 (384MB) is the response size of the full output.

2024-04-11T19:29:13,672+0300 [prometheus-stats-51-1] ERROR org.apache.pulsar.broker.stats.prometheus.PrometheusMetricsServlet - Failed to send metrics, likely the client or this server closed the connection due to a timeout (373 ms elapsed): org.eclipse.jetty.io.EofException
2024-04-11T19:29:13,672+0300 [prometheus-stats-51-1] INFO  org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [11/Apr/2024:19:28:48 +0300] "GET /metrics/ HTTP/1.1" 200 384122607 "-" "k6/0.50.0 (https://k6.io/)" 25073
2024-04-11T19:29:14,077+0300 [prometheus-stats-51-1] ERROR org.apache.pulsar.broker.stats.prometheus.PrometheusMetricsServlet - Failed to send metrics, likely the client or this server closed the connection due to a timeout (405 ms elapsed): org.eclipse.jetty.io.EofException
2024-04-11T19:29:14,077+0300 [prometheus-stats-51-1] INFO  org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [11/Apr/2024:19:28:49 +0300] "GET /metrics/ HTTP/1.1" 200 0 "-" "k6/0.50.0 (https://k6.io/)" 24889

@lhotari
Copy link
Member Author

lhotari commented Apr 11, 2024

Tested standalone with metricsBufferResponse=true optimization created in #14453.

PULSAR_PREFIX_metricsBufferResponse=true docker/pulsar/scripts/apply-config-from-env.py conf/standalone.conf
PULSAR_MEM="-Xms2g -Xmx4g -XX:MaxDirectMemorySize=6g" PULSAR_GC="-XX:+UseG1GC -XX:+PerfDisableSharedMem -XX:+AlwaysPreTouch" PULSAR_EXTRA_OPTS="-Djute.maxbuffer=20000000" PULSAR_STANDALONE_USE_ZOOKEEPER=1 bin/pulsar standalone -nss -nfw 2>&1 | tee standalone.log

It seems that it's able to keep up even with the 100 virtual user k6 test introduced above.

@frankjkelly
Copy link
Contributor

Very cool - I was not aware of that setting.
That said the current description for that parameter isn't helpful in understanding just how useful it can be :-) https://pulsar.apache.org/reference/#/next/config/reference-configuration-broker?id=optional
If true, export buffered metrics

@lhotari
Copy link
Member Author

lhotari commented Apr 11, 2024

the complete 10000 request test passed, downloading 3.6TB of metrics data in under 10 minutes (this was with the local Pulsar standalone on a Mac M3 Max).

❯ k6 run load_test.js

          /\      |‾‾| /‾‾/   /‾‾/
     /\  /  \     |  |/  /   /  /
    /  \/    \    |     (   /   ‾‾\
   /          \   |  |\  \ |  (‾)  |
  / __________ \  |__| \__\ \_____/ .io

     execution: local
        script: load_test.js
        output: -

     scenarios: (100.00%) 1 scenario, 100 max VUs, 10m30s max duration (incl. graceful stop):
              * default: 10000 iterations shared among 100 VUs (maxDuration: 10m0s, gracefulStop: 30s)


     data_received..................: 3.6 TB 6.2 GB/s
     data_sent......................: 880 kB 1.5 kB/s
     http_req_blocked...............: avg=44.33µs min=1µs      med=3µs     max=5.5ms    p(90)=4.1µs   p(95)=6µs
     http_req_connecting............: avg=35.38µs min=0s       med=0s      max=4.19ms   p(90)=0s      p(95)=0s
     http_req_duration..............: avg=5.74s   min=196.38ms med=5.74s   max=6.34s    p(90)=5.99s   p(95)=6.14s
       { expected_response:true }...: avg=5.74s   min=196.38ms med=5.74s   max=6.34s    p(90)=5.99s   p(95)=6.14s
     http_req_failed................: 0.00%  ✓ 0         ✗ 10000
     http_req_receiving.............: avg=66.89ms min=47.39ms  med=67.33ms max=192.11ms p(90)=76.73ms p(95)=82.04ms
     http_req_sending...............: avg=24.05µs min=3µs      med=14µs    max=5.43ms   p(90)=26µs    p(95)=43µs
     http_req_tls_handshaking.......: avg=0s      min=0s       med=0s      max=0s       p(90)=0s      p(95)=0s
     http_req_waiting...............: avg=5.67s   min=4.25ms   med=5.67s   max=6.25s    p(90)=5.92s   p(95)=6.07s
     http_reqs......................: 10000  17.325729/s
     iteration_duration.............: avg=5.74s   min=197.65ms med=5.74s   max=6.34s    p(90)=5.99s   p(95)=6.14s
     iterations.....................: 10000  17.325729/s
     vus............................: 3      min=3       max=100
     vus_max........................: 100    min=100     max=100


running (09m37.2s), 000/100 VUs, 10000 complete and 0 interrupted iterations
default ✓ [======================================] 100 VUs  09m37.2s/10m0s  10000/10000 shared iters

@dao-jun
Copy link
Member

dao-jun commented Apr 12, 2024

Very cool - I was not aware of that setting. That said the current description for that parameter isn't helpful in understanding just how useful it can be :-) https://pulsar.apache.org/reference/#/next/config/reference-configuration-broker?id=optional If true, export buffered metrics

I'll improvement the description of the configuration field.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
3 participants