-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metricbeat - non incremental network usage metrics #2783
Comments
Marked as an enhancement request, but to be honest we're unlikely to implement this, because the general strategy is to do derivates at query time via pipeline aggregations. So if we were to add derivates in Metricbeat, they would be only a temporary solution. It's worth mentioning that TImelion is available in Kibana 4.2 as a plugin. |
I don't understand how can you use incremental metrics. Please show me some real exaples how to use them, because I don't know how can I write triggers based on them nor, how can I present them on a singe graph with other metrics like CPU usage and so on. Please point me to some docs or kibana queries. |
Here are the relevant docs for pipeline aggregations: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-derivative-aggregation.html I think Watcher supports pipeline aggregations, so you should be able to use that in watches. Unfortunately Kibana doesn't support that yet, so currently the only way of getting derivative graphs in Kibana is via Timelion. We're working on improving on that. We realize this is quite inconvenient at the the moment (we have the same issue in our sample dashboards), but we're hesitant to add a temporary solution when a much better and complete one is on the horizon. |
It will be always inconvenient because elasticsearch is not always used with kibana. We are using it with spark (Dataframes) and querying incremental data is also very painful in SQL. Is it possible to submit also deltas next to incremental values? |
+1 |
I use Grafana (derivative function) to handle these incrementing counters. |
your metricbeat video from 7 dec demonstrates again, that we need deltas, otherwise you could only use (limited) timelion deriviate as shown in the video. |
@kaem2111 Thanks for joining into the webinar. Can you elaborate on the "limited" part of the derivatives in timelion? |
+1
I want my metrics simple with filters and deltas :) |
@heilaaks Thanks for the inputs.
|
An additional not from my side to this thread, why I like incremental metrics over precalculated deltas in most cases. I'm aware that this is not an answer to some of the above problems, but I thought it's work sharing these thoughts. In case a data point is lost for whatever reason, the calculated total becomes incorrect. But based on the total always the correct derivatives can be calculated. Also data points can be removed over time for compression and still the correct values can be calculated. A very simple example. Assuming we have each second a data point for our network data and we have the following 4 data points of total values, one for each second: Assuming we now loose for whatever reason, the second data point, the data sets would look as following: |
Hi Regarding exports. We are pulling this data from Spark and do some data mining and predictions. Most of advance machine learning tools are python based or spark based, and theres no way to do it in kibana. Using delta values allow us to easier preprocess data and filter them. With incremental data we have to always pull everything which is far to much in terms of MB. Right now we are reapplying metrics which are incremental from our custom code to provide deltas. Best Mikolaj |
Hello, For the metrics, I could see two different use cases from our point of view: 1) in-house development and 2) customer site troubleshooting. Due to various excuses, it is difficult to have dedicated Elasticsearch cluster to store the metrics for case 1). Because the software changes every hour, I would need to store something for the reference. For the case 1), I want also to collect the data from large environments for offline analysis. The large environments are rare and expensive so I do not want to hold them unnecessarily. For the case 2), the system is too complex for most end users to troubleshoot and we would need to see the metrics in details for at least 3-4 days. I would like also align them visually to search anomalies for example between networking and disk IO. This means that the data has to be moved from customer to development in sane size that goes through a specific tool chain. Due to various excuses, we are not able to dump the Elasticsearch data from customers and import it back for Elastic analytics as of now. The Kibana PDF export is nice addition but it is a bit limited for these uses cases. The flexibility of for example Plotly graph is very nice for the offline analysis. I attached one example at the end. Perhaps these points highlight more limitations from our usage and competence point of view than limitations in Elastic SW. If we would have everything nicely in place, running machine learning based analytics in order to observe anomalies from metrics with ELK and e.g. Spark to combine more sophisticated actions would be better approach. We try to achieve more sophisticated solutions and more suitable architecture, but we are not there yet. What I have been looking for metrics is:
@ruflin. Your comment about incremental metrics and estimates sounds good for the bullet 3 above. This might be good for cases with fast monitoring intervals (e.g once per 1s) and automated analyzes for anomalies. But our case is simple and I am again lazy to write, test and use differently behaving metrics. For the Metricbeats filters, I just cannot get them to work. I would need the filters to select only few services to send the CPU and memory metrics. This would improve the performance of single node Elasticsearch and Kibana that collects only metrics from multiple hosts. I think that the problem with the filters is that the processes filter by defaults maps to service name that is in case of e.g. Java and Python just 'java' or 'python' by default. We could do the separation from username that is mapped to service name or from service command line. The metricbeat processor actions do not include filters that would 'match_event' which would allow me to select simply 'kafka,zookeeper,spark'. For the drop_event I need to write not based regexp from multiple fields (username and cmdline) that complicates the syntax. For example I tried to create a filter that should drop all the events but they still keep coming so obviously I have misunderstood something :) I could use full example from metricbeat.yml for dummies with complex service filtering.
Plotly example: |
There's a filtering example here. Note that you need to |
@mhainfarecom @heilaaks Thanks for sharing the insights from your side. @heilaaks Can you share some details on what you missed on the Kafka side so we can potentially add it? Consumer groups will be part of the next release if that is what you are missing ;-) |
With my limited competence and not verifying the statements. The Kafka monitoring can generate a lot of data and can get complicated. Also what metrics to get is also a matter of opinion and use case. I think that the first problem with the Kafka is to understand what is happening inside Kafka streams and to get a view how much the Kafka is for example consuming from the networking point of view. Kafka itself is very fast and I would not be worried about the performance (latencies or data rates) of Kafka in normal case. Because of these reasons, I would like to see from Kibana:
Few things out of the hat without checking manuals that may be helpful:
Because of possible complications, I would first implement a basic set that concentrates on topic and cluster level view rather than providing all the details. Make the module simple to use and respond basic analytics needs and then improve. Thank you. I managed to get the filters to work. I wrote and example with tips and tricks in Metricbeat forum posting. |
@heilaaks that's quite an essay. I don't want to hijack this issue about network metrics from system module point of view for monitoring kafka. It's still a worthwhile discussion though, as we just started with kafka monitoring support. Unfortunately kafka monitoring is quite a beast in comparison to other systems and it will take some time to improve this one even further. Can you please open another github issue or discuss topic for follow up discussions? |
+1 for this. Having to calculate the deltas ourselves precluded the use of Metrics Beat. Our ES documents have to fit into a specific pattern in order for our dashboard to display them correctly. I assume this is an issue because the underlying go library gives you a cumulative value, which would require you to maintain the previous value somewhere. If someone were to put together a PR would you prefer to always include the delta values, or have some sort of configuration? |
@falken Having elastic/kibana#9725 in Kibana 5.4 should make it much easier to deal with derivatives. See also #2783 (comment) for some more reasoning (and rest of thread). Can you share more details on your specific pattern and the use case? |
When you try to monitor a rather busy web server, system.network.in/out.bytes metric values periodically overflow as they reach MAX_LONG. The consequence are jagged charts capped at MAX_LONG if you try to visualize the raw data, and charts with negative values if you try to use derivatives. I find negative values in network usage particularly annoying. They are not only unaesthetic, but they also make legend values like avg, min or current totally useless. All this could be avoided by getting delta values directly from metricbeat. |
Because that's one true way... Everybody should use Emacs! Can we now please close this ticket... |
@ruflin We ended up just accumulating this in Logstash before sending to ElasticSearch. It's not very elegant, but it works. We don't actually use Kibana for the affected portion of the app. We query ElasticSearch directly in order to determine if something is out of wack before alerting. |
@ruflin do you still hold that incremental is superior, considering the interface counters loop and to get a usable graph of what traffic is flowing through my servers now I need this beast? :
As discussed that only works on some timeframes - change the period and we have to adjust the query. I'd much rather be investigating interesting patterns than figuring out how to display a simple traffic graph - if Elastic wants metricbeat to be the de-facto choice for collecting metrics then this needs to be way simpler.. |
@Nodens2k you poor man. Next time you buy a new machine, just get one with wider registers; As explained above, this is the 'one true way'. And besides, even if it wasn't, they are just not that elastic about this issue. Now regarding your new machine: However even with a better machine, you might wish to make sure you're NOT running any of this thing called Javascript inside your Kibana[1], because Javascript can only correctly handle subtraction of 53-bit integers. Any more than that and it is going to loose precision. Now if you really need to use Kibana on a high traffic server while analyzing your numbers with Javascript, just make sure to restart the server about once a month and you will probably be fine. Oh and by the way... don't use any of those Sum aggregations on such metrics either. You need to first subtract those numbers and then and then sum them together, rather than the other way around. So now you know it. [1] Some browsers have an option to disable Javascript altogether so that should be safe as well. |
Yes, I know. I was being sarcastic. I would like to excuse myself to the whole community and especially to people from Elastic. They are providing us with a great service and a wonderful product, and more than that, they are even giving it away for free. They certainly did not deserve disrespect. It is my personal opinion, that in this case however they seem to have failed to hear the feedback, so I spiced the discussion a little bit -- apparently in a wrong way. No matter what dealing with network metrics is a major pain and I don't think that Our metricbeat indices have thousands of fields and I can't see how 2 more would cause a major damage in this department. Sorry again and thanks to @falken for pointing out his disagreement with my ways. |
@hilt86 @pkese Thanks for bringing up this issues again. Not directly implementing the change does not mean we are not listening. It's good to see people being passionate about the product. A few thoughts from my side since my last commend in March:
So far the main solution that is discussed is to add the non incremental values to metricbeat which would be pretty easy to do from an engineering perspective. I don't worry about adding 2 fields, I more worry about to how many other fields the logic applies? The network metrics mentioned here are definitively the two low hanging fruits. An other solution is what @falken did to do the conversion in LS or ES. We could even provide ingest pipelines for that in metricbeat. Other options of implementation? |
If we went with incremental counters then in Kibana's TSVB @pkese The "sum" network traffic is handled by the "Series Agg" TSVB special aggregation. You have to split the series by the hosts you're trying to aggregate together, do all the calculations (derivative) on each individual host metric, then aggregate those values back together (for each bucket). @Nodens2k We added an aggregation to TSVB called "positive only" that you should use with derivatives that will drop all those negative dips when the counters reset. Here is a video on using TSVB to visualize rates: https://youtu.be/CNR-4kZ6v_E As far as moving to incremental counters I'm 👎 From a visualizing perspective I don't think it provides you with any advantages over the current counters we have today. |
I would love incremental counters for the Metricbeat network and disk based values as we are stuck using Kibana 4.5 :( |
I switched from collectd and dockerbeats to metricbeat in order to simplify and unify our collection of runtime/docker stats. The motivation was to get in-line with current development and simplify the deployment. We wanted to use the data to display basic graphs i.e. CPU, memory and disk IO usage. The graphs are not rendered in Kibana, but within the web UI of the software we develop. Currently we face the problem how to use the incremental disk IO metrics provided by metricbeat docker module to render graphs that outline the demands of the system as it evolves over time - i.e. read/write bytes p. second or a similar information. I guess the same will be true for other metrics as well. For this use-case, I believe it would be an advantage to have non-incremental / delta presentation of the metrics available as well. The argument for its inclusion would therefore be to enable easy adoption of metricbeat over collectd and dockerbeats for non-expert users. |
I am going to close this issue as it is unlikely that this is implemented on metricbeat. I am not going to follow with the discussion because there are already enough arguments here for all kinds of opinions 🙂 |
This has just been added as a workarround and/or comparison. Metricbeat seems to stick to incremental values so it prabably makes sense to comply with that (elastic/beats#2783).
It will be very useful if metricbeat will end network usage also as non incremental values. Could be delta between time of each execution. We should have this additional counters (or replace existing one) to be able to:
The text was updated successfully, but these errors were encountered: