Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect Prometheus latency stats using DataSketches #1245

Closed
wants to merge 10 commits into from

Conversation

merlimat
Copy link
Contributor

The implementation for collecting and estimating the latency quantiles in the Prometheus Java client library is very slow and it is impacting the the bookie performance.

I have added a micro-benchmark that tests our various stats providers. These tests are simulating 16 concurrent threads updating the stats.

Counter increment

Benchmark                              (statsProvider)   Mode  Cnt    Score      Error   Units
StatsLoggerBenchmark.counterIncrement       Prometheus  thrpt    3  391.882 ±  786.987  ops/us
StatsLoggerBenchmark.counterIncrement         Codahale  thrpt    3  449.341 ± 1337.736  ops/us
StatsLoggerBenchmark.counterIncrement          Twitter  thrpt    3   43.354 ±    9.331  ops/us
StatsLoggerBenchmark.counterIncrement          Ostrich  thrpt    3   43.790 ±    1.332  ops/us

Here prometheus is fast, though not as fast as a simple LongAdder which can reach ~500M ops/sec.

Latency quantiles

Benchmark                              (statsProvider)   Mode  Cnt    Score      Error   Units
StatsLoggerBenchmark.recordLatency          Prometheus  thrpt    3    0.255 ±    0.667  ops/us
StatsLoggerBenchmark.recordLatency            Codahale  thrpt    3    4.963 ±    1.671  ops/us
StatsLoggerBenchmark.recordLatency             Twitter  thrpt    3    4.793 ±    0.766  ops/us
StatsLoggerBenchmark.recordLatency             Ostrich  thrpt    3    2.473 ±    6.394  ops/us

Here is where Prometheus is super-slow: 250K ops/second max, mostly due to contention and GC pressure.

Modification

I have re-adapted a stats collector I had done in the Yahoo branch:
https://github.com/yahoo/bookkeeper/tree/yahoo-4.3/bookkeeper-stats-providers/datasketches-metrics-provider/src/main/java/org/apache/bokkeeper/stats/datasketches

This is based on the DataSketches library to have very fast and lightweight quantile estimates (along with a number of other operations), plus some tricks to avoid concurrency issues by using thread local collectors and aggregating when needed in background.

After the change, the throughput is 150x the original prometheus collector.

Benchmark                              (statsProvider)   Mode  Cnt    Score     Error   Units
StatsLoggerBenchmark.counterIncrement       Prometheus  thrpt    3  531.906 ± 129.602  ops/us
StatsLoggerBenchmark.recordLatency          Prometheus  thrpt    3   27.538 ±   5.893  ops/us

It is worth noting that the main bottle-neck in the recordLatency test is now the System.nanoTime()
call used to pass different samples to the stat logger.

System.nanoTime() is not super fast:

Benchmark                               (statsProvider)   Mode  Cnt    Score     Error   Units
StatsLoggerBenchmark.currentTimeMillis              N/A  thrpt    3  161.502 ± 267.238  ops/us
StatsLoggerBenchmark.nanoTime                       N/A  thrpt    3   32.822 ±   2.256  ops/us

By removing the System.nanoTime() call from the benchmark, the Prometheus+DataSketches collector results in:

Benchmark                               (statsProvider)   Mode  Cnt    Score     Error   Units
StatsLoggerBenchmark.recordLatency           Prometheus  thrpt    3  108.542 ±  31.848  ops/us 

@merlimat merlimat added this to the 4.7.0 milestone Mar 10, 2018
@merlimat merlimat self-assigned this Mar 10, 2018
Copy link
Member

@sijie sijie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@merlimat you might need to update LICENSE-bin.all.txt to include the datasketches library.

Otherwise the change looks good to me.

@eolivelli
Copy link
Contributor

@merlimat the change looks good. Great work!

What about adding a minimal test case on the servlet?
We will start soon to chase code not covered but testcases, it will be better not to add new code without proper testing

Copy link
Member

@jiazhai jiazhai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, this is great work.
Seems CI failed for this, may need fix it before merge:

[INFO] org.apache.bookkeeper.stats.prometheus.PrometheusServlet is Serializable; consider declaring a serialVersionUID [org.apache.bookkeeper.stats.prometheus.PrometheusServlet] At PrometheusServlet.java:[lines 41-185] SE_NO_SERIALVERSIONID

@merlimat
Copy link
Contributor Author

What about adding a minimal test case on the servlet?

Yes.. actually there was a missing " in the output already :) . I'll add the test parsing the output shortly.

@sijie sijie closed this in 3bff199 Mar 15, 2018
@merlimat merlimat deleted the prometheus-datasketches branch April 17, 2018 06:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants