-
Notifications
You must be signed in to change notification settings - Fork 895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect Prometheus latency stats using DataSketches #1245
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@merlimat you might need to update LICENSE-bin.all.txt to include the datasketches library.
Otherwise the change looks good to me.
@merlimat the change looks good. Great work! What about adding a minimal test case on the servlet? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, this is great work.
Seems CI failed for this, may need fix it before merge:
[INFO] org.apache.bookkeeper.stats.prometheus.PrometheusServlet is Serializable; consider declaring a serialVersionUID [org.apache.bookkeeper.stats.prometheus.PrometheusServlet] At PrometheusServlet.java:[lines 41-185] SE_NO_SERIALVERSIONID
Yes.. actually there was a missing |
The implementation for collecting and estimating the latency quantiles in the Prometheus Java client library is very slow and it is impacting the the bookie performance.
I have added a micro-benchmark that tests our various stats providers. These tests are simulating 16 concurrent threads updating the stats.
Counter increment
Here prometheus is fast, though not as fast as a simple
LongAdder
which can reach ~500M ops/sec.Latency quantiles
Here is where Prometheus is super-slow: 250K ops/second max, mostly due to contention and GC pressure.
Modification
I have re-adapted a stats collector I had done in the Yahoo branch:
https://github.com/yahoo/bookkeeper/tree/yahoo-4.3/bookkeeper-stats-providers/datasketches-metrics-provider/src/main/java/org/apache/bokkeeper/stats/datasketches
This is based on the DataSketches library to have very fast and lightweight quantile estimates (along with a number of other operations), plus some tricks to avoid concurrency issues by using thread local collectors and aggregating when needed in background.
After the change, the throughput is 150x the original prometheus collector.
It is worth noting that the main bottle-neck in the
recordLatency
test is now theSystem.nanoTime()
call used to pass different samples to the stat logger.
System.nanoTime()
is not super fast:By removing the
System.nanoTime()
call from the benchmark, the Prometheus+DataSketches collector results in: