Collect discovery metrics in a generic fashion #20

sjmudd · 2016-12-27T09:51:13Z

I have been running orchestrator for some time with a custom patch which generates discovery metrics, information for each poll on the time it took to get the status of the MySQL server being checked. Information collected showed how long was spent doing "database calls" on the server being discovered/polled and also on the orchestrator backend database.

I noticed that when orchestrator was polling a large number of MySQL servers that the metrics could vary significantly depending on the location of the orchestrator server compared to the orchestrator backend database. This information has been used to identify and fix several issues and also to provide a bulk import mechanism all of which has been incorporated into orchestrator via pull requests.

However, the metric collection patches have not been provided as pull requests as they were rather ugly and I had not come up with a mechanism which seemed generic enough to be used by anyone.

This issue is to discuss my ideas on solving this properly.

This comprises two parts:

collect the metrics for each discovery
making them visible to an external user for adding to their own monitoring system

The first part has been done against outbrain/orchestrator code so needs to be adapted against github/orchestrator code. That should be relatively straightforward.

For the second part I'd like to generate two API endpoints:

JSON array containing raw data for each discovery collected over the last period P ( say 60-120 seconds, configurable ). This would contain timestamp / hostname:port / metric values. External users would need to then generate "aggregate metrics" for each time period monitored. Users can generate any metrics they want based on these raw values.
JSON structure containing a pre-defined set of aggregated data based on the previous values which could be used directly. This simplifies collection for most users as the aggregations can be used directly without needing to do calculations.

Example aggregated values I use are:

success/ failure counts
median/95percentile latency total discovery time
median/95percentile latency talking to discovered host
median/95percentile latency talking to orchestrator backend

The example above shows 2 different orchestrator systems monitoring some servers. As you can see a small spike shows a sudden unexpected change in metrics times (probably not important here). Metric times are different as the monitoring orchestrator servers are located in different datacentres.

I think that providing the information in the way described would make it easy for any user to collect the values and incorporate them into their own monitoring or graphing systems.

More specific details can be discussed but this issue is discuss this change which I propose to provide as a pull request in the near future.

sjmudd · 2016-12-27T10:50:47Z

This is probably clearer as it shows the poll time of the instance. Backend times on the previous graph are minimal as I'm doing bulk imports. Previously they were a major source of the poll latency.

This also highlights that if you run multiple orchestrator servers across datacentres that performance can change quite significantly by moving the monitoring server. This is not so apparent here but was more so with bulk inserts disabled where the backend latency was the major part of the time to poll the instance.

shlomi-noach · 2016-12-28T11:45:29Z

metrics collections:

please collect any metric you see fit to collect. All goes via the metrics library, as in this code

The metrics library should be enough for you to collect any metrics of any form; declare upfront whether it is a gauge or counter, and you should be good.

As for making those metrics visible:

Please consider outbrain/orchestrator#255 . This PR exposes all registered metrics, automatically, via a exprvar-like JSON via /web/debug/metrics. Admittedly, this should go by /api/debug/metrics -- but please do check out this path on your own deployment and let me know if this makes sense to you!

sjmudd · 2016-12-28T18:10:47Z

Shlomi this does not work.

I want to be able to obtain derivative values of individual discovery metrics in a time period. Some of those like the mean can be done but others like the percentile values, median or max can not: you need the raw values to calculate them. Also in this change I am likely to split the discovery time by time taken to access the polled instance, time taken to access the orchestrator backend and also total time. These values won't all agree as connect timeout latency can influence things and unhealthy boxes tend to take longer to respond or not respond at all thus modifying the normal range of metrics you might expect to see. Latency between data centres can also affect these numbers as you may have ping times of sub millisecond within a datacentre and multiple tens of milliseconds outside of that. With a large number of instances this can noticeably affect the way the backend orchestrator database behaves and is one reason for the buffered writes patch that has been provided previously. However you can only see this if you measure it and those measurements are not in the current orchestrator code (but are in mine). This may not seem important but becomes increasingly so as the number of monitored MySQL servers grows.

I do not mind setting up some derived values which may make sense for others and that is indeed my plan (once the raw collection is working) as that at least will provide users something which may be useful and requires little work to collect.

However, even then I may want to generate other metrics from the raw values. I may have an interest in using an external script to generate "per environment", "per dc", "per rack" or other such metrics which make no sense for anyone else. The raw metrics give me absolute freedom to do whatever I need without changing the orchestrator code. I get the last N seconds of discovery metrics in "raw form". I request that period, and orchestrator is configured to retain values for a longer/configurable period after which it periodically purges data in the background automatically.

With my change I can also collect the metrics over the network compared to my current scripts which have to run locally to tail the orchestrator log file.

So I understand your desire to use the metrics library and it is helpful but it does not cover my use case fully. I really want to stop scraping the orchestrator log file which grows by huge amounts and is required by the current scripts I am using even if I never look at the data. That noise prevents me from seeing interesting, useful or important information which is easy to miss at the moment.

With the type of change I propose logging can be quiet and I can pull and generate the derivative information I am currently using directly from the active instance as needed.

Hope that makes sense.

sjmudd · 2017-03-25T10:01:12Z

I think this issue can be closed now. Several PRs I've provided have added discovery metrics via API calls so this is now resolved. Closing.

sjmudd added the enhancement label Dec 27, 2016

sjmudd closed this as completed Mar 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect discovery metrics in a generic fashion #20

Collect discovery metrics in a generic fashion #20

sjmudd commented Dec 27, 2016 •

edited

sjmudd commented Dec 27, 2016 •

edited

shlomi-noach commented Dec 28, 2016

sjmudd commented Dec 28, 2016

sjmudd commented Mar 25, 2017

Collect discovery metrics in a generic fashion #20

Collect discovery metrics in a generic fashion #20

Comments

sjmudd commented Dec 27, 2016 • edited

sjmudd commented Dec 27, 2016 • edited

shlomi-noach commented Dec 28, 2016

sjmudd commented Dec 28, 2016

sjmudd commented Mar 25, 2017

sjmudd commented Dec 27, 2016 •

edited

sjmudd commented Dec 27, 2016 •

edited