New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect discovery metrics in a generic fashion #20
Comments
This is probably clearer as it shows the poll time of the instance. Backend times on the previous graph are minimal as I'm doing bulk imports. Previously they were a major source of the poll latency. This also highlights that if you run multiple orchestrator servers across datacentres that performance can change quite significantly by moving the monitoring server. This is not so apparent here but was more so with bulk inserts disabled where the backend latency was the major part of the time to poll the instance. |
metrics collections: please collect any metric you see fit to collect. All goes via the The As for making those metrics visible: Please consider outbrain/orchestrator#255 . This PR exposes all registered metrics, automatically, via a exprvar-like JSON via |
Shlomi this does not work. I want to be able to obtain derivative values of individual discovery metrics in a time period. Some of those like the mean can be done but others like the percentile values, median or max can not: you need the raw values to calculate them. Also in this change I am likely to split the discovery time by time taken to access the polled instance, time taken to access the orchestrator backend and also total time. These values won't all agree as connect timeout latency can influence things and unhealthy boxes tend to take longer to respond or not respond at all thus modifying the normal range of metrics you might expect to see. Latency between data centres can also affect these numbers as you may have ping times of sub millisecond within a datacentre and multiple tens of milliseconds outside of that. With a large number of instances this can noticeably affect the way the backend orchestrator database behaves and is one reason for the buffered writes patch that has been provided previously. However you can only see this if you measure it and those measurements are not in the current orchestrator code (but are in mine). This may not seem important but becomes increasingly so as the number of monitored MySQL servers grows. I do not mind setting up some derived values which may make sense for others and that is indeed my plan (once the raw collection is working) as that at least will provide users something which may be useful and requires little work to collect. However, even then I may want to generate other metrics from the raw values. I may have an interest in using an external script to generate "per environment", "per dc", "per rack" or other such metrics which make no sense for anyone else. The raw metrics give me absolute freedom to do whatever I need without changing the orchestrator code. I get the last N seconds of discovery metrics in "raw form". I request that period, and orchestrator is configured to retain values for a longer/configurable period after which it periodically purges data in the background automatically. With my change I can also collect the metrics over the network compared to my current scripts which have to run locally to tail the orchestrator log file. So I understand your desire to use the metrics library and it is helpful but it does not cover my use case fully. I really want to stop scraping the orchestrator log file which grows by huge amounts and is required by the current scripts I am using even if I never look at the data. That noise prevents me from seeing interesting, useful or important information which is easy to miss at the moment. With the type of change I propose logging can be quiet and I can pull and generate the derivative information I am currently using directly from the active instance as needed. Hope that makes sense. |
I think this issue can be closed now. Several PRs I've provided have added discovery metrics via API calls so this is now resolved. Closing. |
I have been running orchestrator for some time with a custom patch which generates discovery metrics, information for each poll on the time it took to get the status of the MySQL server being checked. Information collected showed how long was spent doing "database calls" on the server being discovered/polled and also on the orchestrator backend database.
I noticed that when orchestrator was polling a large number of MySQL servers that the metrics could vary significantly depending on the location of the orchestrator server compared to the orchestrator backend database. This information has been used to identify and fix several issues and also to provide a bulk import mechanism all of which has been incorporated into orchestrator via pull requests.
However, the metric collection patches have not been provided as pull requests as they were rather ugly and I had not come up with a mechanism which seemed generic enough to be used by anyone.
This issue is to discuss my ideas on solving this properly.
This comprises two parts:
The first part has been done against outbrain/orchestrator code so needs to be adapted against github/orchestrator code. That should be relatively straightforward.
For the second part I'd like to generate two API endpoints:
Example aggregated values I use are:
The example above shows 2 different orchestrator systems monitoring some servers. As you can see a small spike shows a sudden unexpected change in metrics times (probably not important here). Metric times are different as the monitoring orchestrator servers are located in different datacentres.
I think that providing the information in the way described would make it easy for any user to collect the values and incorporate them into their own monitoring or graphing systems.
More specific details can be discussed but this issue is discuss this change which I propose to provide as a pull request in the near future.
The text was updated successfully, but these errors were encountered: