Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

high level metrics need #3889

Closed
ph opened this issue Sep 9, 2015 · 21 comments
Closed

high level metrics need #3889

ph opened this issue Sep 9, 2015 · 21 comments

Comments

@ph
Copy link
Contributor

ph commented Sep 9, 2015

Currently in Logstash there is no way to easily collect metrics from a runnings instance, lets define our needs, do we need to build a custom solution or we can leverage existing code. We have to keep in mind that we are currently rewriting important parts of the pipeline in pure java to leverage more speed from the JVM, so we need to have an hybrid solution that will grow and adapt with the code base.

Use cases of metrics collection in the current system.

We want to collect metrics from different part of the system, I think we should try to have a low level library usable in both scenario, we can provide helpers method via metaprogramming hiding away the complexity of the data collection, this would allow us to turn on of off stats collecting for some part of the code.

  • From the pipeline itself
    • In/out of the queues
    • In/out of the filter method of different plugins
    • Metrics from conditionals branching (detecting slow part)
  • From the plugins
    • Webservice latencies
    • heavy computations (grok)
    • retries count
    • Error count
    • Internal Buffering (like multilines filter, lumberjack output and elasticsearch output)
    • custom metrics defined by plugins authors
  • From the event itself?

Where do we want to expose the metrics collection

I think we have two different needs:

data collection for long terms analysis using external services

The system need to be able to configure how to expose them and at what frequency we need to update them. For configuration, it would be nice if we could reuse how we are currently configuring plugins, we have to use a separate queue and separate config to make sure that data collecting and the normal operations of the pipeline doesn't interfere with each other.

The syntax could be something like this:

metrics {
  elasticsearch {}
  statsd {}
}

We are thinking about supporting theses services first:

  • To elasticsearch, using a format similar to marvel
  • JMX
  • To other collecting services like statsd

Adhoc requests via an API.

Theses kind of request are a bit different. But first let me explain a few things, each time the system does a flush of the metrics I see this as a snapshot, an API is simply a view over a window representing multiples snapshots that could be aggregated or not depending of the type of the metrics of the collected namespace.

Discoverability

Every recorded metrics keys should be discoverable, either in the API from a custom endpoints or in the external services.

Metrics initial requirements

I think most of our need are basic and we can answers our questions using standard primitives.

  • counters (increment, decrement)
    How many many events we had since startup.
    How many times plugins was restarted?
  • Timers
    What is the speed of this specific filter?
    Which conditionals is the slowest?
  • Gauges
    How many events in the Queues?

Each of the metrics need to be saved in a specific namespace (or key) that represent the collected data. Considering the snapshots and the window, maybe namespaces could have differents local retention policies?

Collecting and metrics aggregation from the differents endpoints should be done asynchronously to not block normal pipeline operation.

First steps

Evaluate existing libraries, there is a lot a prior work done in ruby or in the JVM to do metrics collection we should leverage theses or at least inspire us from them if they don't support all we want to do.

Initial list:

Ref: #2611

@jsvd
Copy link
Member

jsvd commented Sep 9, 2015

I believe this issue could be further split into collection and outputting of metrics.

Collecting will deal with the necessary apis between agent/pipeline/plugins/MetricsCollectorFactoryFactory and also how each metric-producing entity will generate metrics.

Outputting will the use the other side of MetricsCollectorFactoryFactory to push collected metrics into, for example, an extra stage of the pipeline. Also which requirements and limitations does this stage have.

@ph
Copy link
Contributor Author

ph commented Sep 9, 2015

+1 @jsvd I was thinking about splitting it, this is why this is called low level to dicuss the preliminary common stuff.

@colinsurprenant
Copy link
Contributor

quick idea, plugins could have a set of "standard" predefined metrics but could also expose "custom" metrics which should be discoverable through an endpoint. For example, dumping metrics for a particular plugin would involve first dumping the standard metrics and then gathering the list of custom metrics and dumping them too. thoughts?

@ph
Copy link
Contributor Author

ph commented Sep 9, 2015

@colinsurprenant I have added custom metrics defined by plugins authors in the list and endpoints discoverability which is in reality a namespace/keys explorer.

@jsvd
Copy link
Member

jsvd commented Sep 9, 2015

when I hear gathering and discoverability in regards to plugins announcing the existence of custom metrics to the collector does that mean the metric collection model is polling and not pushing (i.e. collector periodically gathers metrics from the entities)?

@ph
Copy link
Contributor Author

ph commented Sep 9, 2015

Concerning gathering and discoverability, I see it as a push models, If we never pushed metrics to a specific key or namespace (like SQS_ERROR_RATE) we never see it

@jsvd
Copy link
Member

jsvd commented Sep 9, 2015

+1

Regarding the "Metrics initial requirements" I understood we were to do as little aggregation as possible, in order to leave those computations to the destination: statsd, elasticsearch, etc.

Anyway, some existing ruby libraries for metrics collection:
https://github.com/eric/metriks
https://github.com/johnewart/ruby-metrics
You know, for inspiration.

@ph
Copy link
Contributor Author

ph commented Sep 9, 2015

The push model is easier to understand and reason with. Also when a plugin is stuck it will obviously affect the metrics collection for that specific plugin.

@ph
Copy link
Contributor Author

ph commented Sep 9, 2015

@jsvd You are right on this, we are trying to avoid calculation / aggregation as much as we can on the logstash for performance reason and better tools exists like ES :)

@guyboertje
Copy link
Contributor

I raised the idea of a Supervisor running along side the LS pipeline with @jsvd while discussing Shutdown. We can have concern based Supervisors, where an example concern is 'config debug' which implies a REST api based Supervisor and 'ES supervision' implies a ES integrated Supervisor that can take commands from ES and send data to ES.

There is no reason why one could not have multiple supervisors register with the pipeline. In terms of metrics, the pipeline and plugins could push metrics to all supervisors.

When talking of metrics and keys/values I am reminded of the rich API that Redis has for updating values of datatypes (integer, array, set, hash). Could the Supervisor offer a similar API.

When talking about metrics for LS I am reminded of the interesting work that William Louth is doing. I am not saying that we should adopt what he has done, rather that we could stream out fine grained but richly qualified values such that the metrics can be synthesized or the pipeline visualised from the stream.

e.g.
pipeline-start, 234, 0, 567890123
where:
pipeline-start i.e. the key, could be replaced by a well-known integer
234 is the thread id
0 is the event id (ruby object_id)
567890123 is the JVM System nanosecond timestamp

Other keys might include input-data-receive, input-event-enqueue and filter-event-dequeue

A whole bunch of metrics can then be derived - size of any queue, input throughput and state when crashed.

@ph
Copy link
Contributor Author

ph commented Sep 9, 2015

@guyboertje Not sure I see the benefits of sending metrics to multiples supervisors? I was more looking at sending metrics to a centralized broker that could use multiples Reporter or backend and dispatch them?

Concerning the format, I think your idea is worth checking!

But is the event_id really meaningful here?

@ph
Copy link
Contributor Author

ph commented Sep 9, 2015

@guyboertje thanks for making me discover William work, theres is a lot of inspiration in it.

@ph
Copy link
Contributor Author

ph commented Sep 9, 2015

Nevermind I have answered myself concerning the id, I think we should use some kind of UUID instead of the ruby-id. If I understand your comment this would allow us the trace a specific event and trace his performance, that could allow the user to find slower events and changes the pipeline to get better performance.

@guyboertje
Copy link
Contributor

@ph re: multiple controllers or broker and reporters, same concept; no argument from me only to establish that we should not try to make one Process Controller/Reporter cover all use cases.

@guyboertje
Copy link
Contributor

@ph re: event id. Upon reflection a session uuid is needed to group all metrics, whether this is added by the receiver (Process Controller/broker) or by the pipeline is open to question. To answer your question, yes event_id is to enable individual event tracing. Adding a uuid to the event would allow correlation between the event itself (in ES for example) and its related metrics.

@guyboertje
Copy link
Contributor

Initially we could leverage Colin's mmap work to persist the 'perf stream'. <- Possible official name to avoid the overloaded term 'metrics' :-)

@guyboertje
Copy link
Contributor

@ph - yeah, William Louth's work is quite mind blowing.

@guyboertje
Copy link
Contributor

I have said before that I see current LS as an open loop system. Starting with these proposals, we can close the loop (via, initially, a human). I see the Process Controller relaying control commands in and sending perf metrics out. The human may not necessarily interact with the Process Controller directly but rather through a Client that translates perf metrics for human consumption and control input for dispatch to the Process Controller especially if the data store is ES as Shay suggests.

@ph
Copy link
Contributor Author

ph commented Sep 10, 2015

I've created this issue #3892 yesterday following your initial comment, since we already have a metadata field we could also add the plugin id as metadata to the event.

@ph
Copy link
Contributor Author

ph commented Sep 10, 2015

concerning mertrics, I think perf stream is probably a better name if we trace end to end the event through our system.

@guyboertje
Copy link
Contributor

Renamed Supervisor to Process Controller in my comments

@ph ph changed the title Metrics: Low level metrics collection Metrics: high level metrics need Sep 10, 2015
@ph ph changed the title Metrics: high level metrics need high level metrics need Sep 11, 2015
@ph ph closed this as completed Sep 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants