high level metrics need #3889

ph · 2015-09-09T13:43:25Z

Currently in Logstash there is no way to easily collect metrics from a runnings instance, lets define our needs, do we need to build a custom solution or we can leverage existing code. We have to keep in mind that we are currently rewriting important parts of the pipeline in pure java to leverage more speed from the JVM, so we need to have an hybrid solution that will grow and adapt with the code base.

Use cases of metrics collection in the current system.

We want to collect metrics from different part of the system, I think we should try to have a low level library usable in both scenario, we can provide helpers method via metaprogramming hiding away the complexity of the data collection, this would allow us to turn on of off stats collecting for some part of the code.

From the pipeline itself
- In/out of the queues
- In/out of the filter method of different plugins
- Metrics from conditionals branching (detecting slow part)
From the plugins
- Webservice latencies
- heavy computations (grok)
- retries count
- Error count
- Internal Buffering (like multilines filter, lumberjack output and elasticsearch output)
- custom metrics defined by plugins authors
From the event itself?

Where do we want to expose the metrics collection

I think we have two different needs:

data collection for long terms analysis using external services

The system need to be able to configure how to expose them and at what frequency we need to update them. For configuration, it would be nice if we could reuse how we are currently configuring plugins, we have to use a separate queue and separate config to make sure that data collecting and the normal operations of the pipeline doesn't interfere with each other.

The syntax could be something like this:

metrics {
  elasticsearch {}
  statsd {}
}

We are thinking about supporting theses services first:

To elasticsearch, using a format similar to marvel
JMX
To other collecting services like statsd

Adhoc requests via an API.

Theses kind of request are a bit different. But first let me explain a few things, each time the system does a flush of the metrics I see this as a snapshot, an API is simply a view over a window representing multiples snapshots that could be aggregated or not depending of the type of the metrics of the collected namespace.

Discoverability

Every recorded metrics keys should be discoverable, either in the API from a custom endpoints or in the external services.

Metrics initial requirements

I think most of our need are basic and we can answers our questions using standard primitives.

counters (increment, decrement)
How many many events we had since startup.
How many times plugins was restarted?
Timers
What is the speed of this specific filter?
Which conditionals is the slowest?
Gauges
How many events in the Queues?

Each of the metrics need to be saved in a specific namespace (or key) that represent the collected data. Considering the snapshots and the window, maybe namespaces could have differents local retention policies?

Collecting and metrics aggregation from the differents endpoints should be done asynchronously to not block normal pipeline operation.

First steps

Evaluate existing libraries, there is a lot a prior work done in ruby or in the JVM to do metrics collection we should leverage theses or at least inspire us from them if they don't support all we want to do.

Initial list:

Ref: #2611

The text was updated successfully, but these errors were encountered:

jsvd · 2015-09-09T13:51:21Z

I believe this issue could be further split into collection and outputting of metrics.

Collecting will deal with the necessary apis between agent/pipeline/plugins/MetricsCollectorFactoryFactory and also how each metric-producing entity will generate metrics.

Outputting will the use the other side of MetricsCollectorFactoryFactory to push collected metrics into, for example, an extra stage of the pipeline. Also which requirements and limitations does this stage have.

ph · 2015-09-09T13:53:46Z

+1 @jsvd I was thinking about splitting it, this is why this is called low level to dicuss the preliminary common stuff.

colinsurprenant · 2015-09-09T13:58:31Z

quick idea, plugins could have a set of "standard" predefined metrics but could also expose "custom" metrics which should be discoverable through an endpoint. For example, dumping metrics for a particular plugin would involve first dumping the standard metrics and then gathering the list of custom metrics and dumping them too. thoughts?

ph · 2015-09-09T14:02:01Z

@colinsurprenant I have added custom metrics defined by plugins authors in the list and endpoints discoverability which is in reality a namespace/keys explorer.

jsvd · 2015-09-09T14:06:51Z

when I hear gathering and discoverability in regards to plugins announcing the existence of custom metrics to the collector does that mean the metric collection model is polling and not pushing (i.e. collector periodically gathers metrics from the entities)?

ph · 2015-09-09T14:15:47Z

Concerning gathering and discoverability, I see it as a push models, If we never pushed metrics to a specific key or namespace (like SQS_ERROR_RATE) we never see it

jsvd · 2015-09-09T14:16:11Z

+1

Regarding the "Metrics initial requirements" I understood we were to do as little aggregation as possible, in order to leave those computations to the destination: statsd, elasticsearch, etc.

Anyway, some existing ruby libraries for metrics collection:
https://github.com/eric/metriks
https://github.com/johnewart/ruby-metrics
You know, for inspiration.

ph · 2015-09-09T14:17:45Z

The push model is easier to understand and reason with. Also when a plugin is stuck it will obviously affect the metrics collection for that specific plugin.

ph · 2015-09-09T14:20:18Z

@jsvd You are right on this, we are trying to avoid calculation / aggregation as much as we can on the logstash for performance reason and better tools exists like ES :)

guyboertje · 2015-09-09T16:54:00Z

I raised the idea of a Supervisor running along side the LS pipeline with @jsvd while discussing Shutdown. We can have concern based Supervisors, where an example concern is 'config debug' which implies a REST api based Supervisor and 'ES supervision' implies a ES integrated Supervisor that can take commands from ES and send data to ES.

There is no reason why one could not have multiple supervisors register with the pipeline. In terms of metrics, the pipeline and plugins could push metrics to all supervisors.

When talking of metrics and keys/values I am reminded of the rich API that Redis has for updating values of datatypes (integer, array, set, hash). Could the Supervisor offer a similar API.

When talking about metrics for LS I am reminded of the interesting work that William Louth is doing. I am not saying that we should adopt what he has done, rather that we could stream out fine grained but richly qualified values such that the metrics can be synthesized or the pipeline visualised from the stream.

e.g.
pipeline-start, 234, 0, 567890123
where:
pipeline-start i.e. the key, could be replaced by a well-known integer
234 is the thread id
0 is the event id (ruby object_id)
567890123 is the JVM System nanosecond timestamp

Other keys might include input-data-receive, input-event-enqueue and filter-event-dequeue

A whole bunch of metrics can then be derived - size of any queue, input throughput and state when crashed.

ph · 2015-09-09T18:23:00Z

@guyboertje Not sure I see the benefits of sending metrics to multiples supervisors? I was more looking at sending metrics to a centralized broker that could use multiples Reporter or backend and dispatch them?

Concerning the format, I think your idea is worth checking!

But is the event_id really meaningful here?

ph · 2015-09-09T18:40:30Z

@guyboertje thanks for making me discover William work, theres is a lot of inspiration in it.

ph · 2015-09-09T18:53:15Z

Nevermind I have answered myself concerning the id, I think we should use some kind of UUID instead of the ruby-id. If I understand your comment this would allow us the trace a specific event and trace his performance, that could allow the user to find slower events and changes the pipeline to get better performance.

guyboertje · 2015-09-10T08:00:42Z

@ph re: multiple controllers or broker and reporters, same concept; no argument from me only to establish that we should not try to make one Process Controller/Reporter cover all use cases.

guyboertje · 2015-09-10T08:09:25Z

@ph re: event id. Upon reflection a session uuid is needed to group all metrics, whether this is added by the receiver (Process Controller/broker) or by the pipeline is open to question. To answer your question, yes event_id is to enable individual event tracing. Adding a uuid to the event would allow correlation between the event itself (in ES for example) and its related metrics.

guyboertje · 2015-09-10T08:13:22Z

Initially we could leverage Colin's mmap work to persist the 'perf stream'. <- Possible official name to avoid the overloaded term 'metrics' :-)

guyboertje · 2015-09-10T08:14:23Z

@ph - yeah, William Louth's work is quite mind blowing.

guyboertje · 2015-09-10T08:31:48Z

I have said before that I see current LS as an open loop system. Starting with these proposals, we can close the loop (via, initially, a human). I see the Process Controller relaying control commands in and sending perf metrics out. The human may not necessarily interact with the Process Controller directly but rather through a Client that translates perf metrics for human consumption and control input for dispatch to the Process Controller especially if the data store is ES as Shay suggests.

ph · 2015-09-10T12:49:53Z

I've created this issue #3892 yesterday following your initial comment, since we already have a metadata field we could also add the plugin id as metadata to the event.

ph · 2015-09-10T12:50:41Z

concerning mertrics, I think perf stream is probably a better name if we trace end to end the event through our system.

guyboertje · 2015-09-10T15:08:18Z

Renamed Supervisor to Process Controller in my comments

ph added design manageability v2.1.0 monitoring labels Sep 9, 2015

ph changed the title ~~Metrics: Low level metrics collection~~ Metrics: high level metrics need Sep 10, 2015

ph changed the title ~~Metrics: high level metrics need~~ high level metrics need Sep 11, 2015

ph closed this as completed Sep 11, 2015

ph mentioned this issue Sep 11, 2015

Expose metrics in Logstash pipeline #3908

Closed

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

high level metrics need #3889

high level metrics need #3889

ph commented Sep 9, 2015

jsvd commented Sep 9, 2015

ph commented Sep 9, 2015

colinsurprenant commented Sep 9, 2015

ph commented Sep 9, 2015

jsvd commented Sep 9, 2015

ph commented Sep 9, 2015

jsvd commented Sep 9, 2015

ph commented Sep 9, 2015

ph commented Sep 9, 2015

guyboertje commented Sep 9, 2015

ph commented Sep 9, 2015

ph commented Sep 9, 2015

ph commented Sep 9, 2015

guyboertje commented Sep 10, 2015

guyboertje commented Sep 10, 2015

guyboertje commented Sep 10, 2015

guyboertje commented Sep 10, 2015

guyboertje commented Sep 10, 2015

ph commented Sep 10, 2015

ph commented Sep 10, 2015

guyboertje commented Sep 10, 2015

high level metrics need #3889

high level metrics need #3889

Comments

ph commented Sep 9, 2015

Use cases of metrics collection in the current system.

Where do we want to expose the metrics collection

data collection for long terms analysis using external services

Adhoc requests via an API.

Discoverability

Metrics initial requirements

First steps

jsvd commented Sep 9, 2015

ph commented Sep 9, 2015

colinsurprenant commented Sep 9, 2015

ph commented Sep 9, 2015

jsvd commented Sep 9, 2015

ph commented Sep 9, 2015

jsvd commented Sep 9, 2015

ph commented Sep 9, 2015

ph commented Sep 9, 2015

guyboertje commented Sep 9, 2015

ph commented Sep 9, 2015

ph commented Sep 9, 2015

ph commented Sep 9, 2015

guyboertje commented Sep 10, 2015

guyboertje commented Sep 10, 2015

guyboertje commented Sep 10, 2015

guyboertje commented Sep 10, 2015

guyboertje commented Sep 10, 2015

ph commented Sep 10, 2015

ph commented Sep 10, 2015

guyboertje commented Sep 10, 2015