Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

statsd-emitter #2410

Merged
merged 1 commit into from
Apr 29, 2016
Merged

statsd-emitter #2410

merged 1 commit into from
Apr 29, 2016

Conversation

michaelschiff
Copy link
Contributor

After spending some time trying to emit metrics directly from Druid to Ganglia, it became clear that Druid needs some kind of pre-aggregation step before loading metrics into such a store. This type of aggregation seemed beyond the scope of a Metric Emitter (which should really just emit metrics, not do fancy, configurable aggregations), and is exactly what StatsD is designed for.

I am still testing this emitter, but would like to open the PR now so that I can get some eyes from the community on what I am doing.

@michaelschiff michaelschiff force-pushed the statsd-emitter branch 6 times, most recently from 598802c to fb1767e Compare February 6, 2016 00:26
</dependency>
</dependencies>

</project>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new line

@b-slim
Copy link
Contributor

b-slim commented Feb 8, 2016

need more UTs like ser/deser.

private final Map<String, StatsDEmitterConfig.MetricType> metricTypes;

public StatsDEmitter(StatsDEmitterConfig config) {
statsd = new NonBlockingStatsDClient(config.getPrefix(), config.getHostname(), config.getPort());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@b-slim
Copy link
Contributor

b-slim commented Feb 8, 2016

@michaelschiff again my main concern is the fact that StatD will do the aggregation of a huge amount of metrics with a very very high cardinality probably onHeap. This can blow the druid process. I think you need a way to filter out the dimensions with high cardinality like query id.

@michaelschiff
Copy link
Contributor Author

@b-slim this is statd's stated purpose. Also 0 aggregation is done in in the emitter, so I am not sure why you are concerned this will "blow away the druid process". As we discussed (at great length) in the Ganglia pull request, this type of filtering does not really solve its intended problem.

My reasoning regarding the cardinality problem was as such:

  • Filtering out any metrics that have any high cardinality dimension (as in the Graphite emitter). This works, but you are throwing away many useful metrics. (e.g. Historical's query/wait/time and query/segment/time are pretty much off limits due to their high cardinality dimension (segment)) -- I see that the json file allows you to specify some number of dimensions to keep. In the condition that you do not include the full set of dimensions (many of the default cases), how are metrics from different dimension values aggregated? Is Graphite doing this for you? If it is simply plotting them all on the same time-series, you will see some strange things for certain metrics.
  • Aggregate across these dimensions in the emitter. This works in principle (and so long as the # of base metric names stays reasonably small, which it already is, will fit in memory no problem), however you really want to be able to configure the aggregations you are doing, and for what metric prefix you aggregate across. This seems like its going to be a lot of configuration...
  • StatsD is a metric aggregator and forwarder for Graphite and Ganglia. With a system like this, we can emit metrics at their full granularity from druid, but still get all of the fancy aggregation we want from a system purpose built for this problem

@michaelschiff michaelschiff force-pushed the statsd-emitter branch 7 times, most recently from d458bb5 to 31e1179 Compare February 8, 2016 17:13
@fjy fjy merged commit 2203a81 into apache:master Apr 29, 2016
@sascha-coenen
Copy link

Hi all,
I meant to share that we took Michaels contribution and got it to run too. We tested it, made some nice-to-have modifications to it and this extension is currently slated for a stage release and later a prod release in the coming weeks. This is to say that we too believe that this contribution is in a good state.

Things that we discussed /changed internally at our company:

  • as a general thing, there are emitter extensions for graphite, statsd, ganglia and talks about datadog. However, all these extensions are separate but very similar code. We wondered whether it might be desirable and feasible to instead have a single emitter extension based on the coda-hale/dropwizard framework which in turn has support for all the above-mentioned endpoints. In particular, we don't know the statsd library used in this extension and would have preferred the dropwizard lib, but I guess that's mostly personal preference.
  • one colleague expressed concern about the use of an unbounded queue for the store-forwarding of metrics and thought it more safe to have a bounded queue, preferring loss of metrics over potential build-ups in memory. We didn't investigate this thouroughly enough at this point to have a definite answer on whether the current code is actually vulnerable towards the scenario described.
  • we added in support for forwarding alerts to a different emitter in the same way as is the case with the graphite emitter.
  • with statsd, every metrics needs to be mapped to a metric type (timer/gauge/counter). We are not sure whether this mapping is correct. If a metric that is not meant to be a counter is declared as such, the values ending up in dashboards will depend in their magnitude on the emission period configured in Druid. Verifying whether these mappings make sense is difficult because it is a lot of work and requires an exact understanding of the semantic of the metrics in Druid itself. Its an implementation detail of each metric emitted by Druid whether it's designed to work as a counter or gauge.
  • a statsd/graphite pipeline would hack the metric path into folders any time it encounters a dot or colon character. The host names of druid nodes therefore get hacked up in sub domain, domain and port path elements which is undesirable, so we replaced those tokens with underscored within the host dimension.
  • the class StatsD-Emitter logs away an error-level log entry reporting a missing metric-type mapping. This is misleading as the code-path of that log message is is hit for all events that are missing a configuration altogether, which would be the case if someone wants to purposefully exclude a metric from being emitted. Therefore we moved this message into the default clause of the switch statement and degraded the log level to "debug". In place of the existing log statement, we have another debug-level log-event reporting that a given metric is missing from the configuration and is being swallowed.

Somewhat off-topic for the scope of this pull request but related to our testing:

  • we discovered that there are metrics emitted by Druid which are not part of the documentation. Some such metrics pertain to name-space lookups, others to system-level TCP metrics.
  • how Druid handles metric emission internally is somewhat unclear to us. We so far focussed on getting metrics for historicals, brokers, cache, jvm and system working and see them pass through the statsd-emitter. Metrics that are not associated with a Monitor in Druid are missing and therefore we will at this point only be able to focus on improving on the metrics mapping for those metrics that are covered by monitors
  • the extensions are part of the Druid code-base but are not being bundled into deployable units, so every user will have to build their custom druid distribution and take care of bundling in the dependent libraries. At least this is our current understanding.
  • when using the sigar library to get the system-level metrics in, we could so far only get these systems level metrics emitted in a local setup. We also have a docker based druid cluster in which the system metric emission fails due to an exception in sigar. We haven't had the time to look into this and will also exclude this batch of metrics from further scrutiny as our production setup will use collectd for most systems metrics. But its somewhat sad that so many metrics fall away.
  • we are missing a good metric for depicting how many sequential scans a query takes.
    Sorry for the long post and the off-topic stuff, but I meant to include purposefully to describe what we struggled with in order to state which parts of the contribution we cannot help testing.

@sascha-coenen
Copy link

one more: it would be nice to have a counter metric for the alert-typed metrics. It would be possible to build this into the statsd-emitter or one could build it into Druid.

@michaelschiff
Copy link
Contributor Author

michaelschiff commented Apr 29, 2016

@sascha-coenen I totally agree!

Point 1 - We use codahale metrics everywhere in our company's applications, and it is a really nice library. In terms of druid, it would cover both the metric collection in the service, and then expose the Reporter interface which we might still end up implementing for each store (if they don't exist already). I could imagine this being a similar amount of code to a new Emitter. It would also kind of require rethinking how metrics are actually named (i.e. MetricName + combination of dimensions). You definitely don't want to end up storing a Metric in memory for each dimension combination....

Point 4 - I definitely felt this as I was defining the mapping. I did my best to make sure that every type is correct for the metric, but I may have made mistakes. While working on this, it occurred to me that this type information could be part of the Druid metric interface...

Point 6 - Makes sense, that's more convenient.

gianm added a commit to gianm/druid that referenced this pull request May 4, 2016
xvrl pushed a commit that referenced this pull request May 4, 2016
fjy pushed a commit that referenced this pull request Mar 27, 2017
This was mentioned in the original pull (#2410) by @sascha-coenen, and the original author (@michaelschiff) agreed that it seemed reasonable

This commit fixes issue #3960
lucesape pushed a commit to repairnator/repairnator-experiments that referenced this pull request Mar 28, 2017
This was mentioned in the original pull (apache/druid#2410) by @sascha-coenen, and the original author (@michaelschiff) agreed that it seemed reasonable

This commit fixes issue apache/druid#3960
lucesape pushed a commit to repairnator/repairnator-experiments that referenced this pull request Mar 28, 2017
This was mentioned in the original pull (apache/druid#2410) by @sascha-coenen, and the original author (@michaelschiff) agreed that it seemed reasonable

This commit fixes issue apache/druid#3960
lucesape pushed a commit to repairnator/repairnator-experiments that referenced this pull request Mar 29, 2017
This was mentioned in the original pull (apache/druid#2410) by @sascha-coenen, and the original author (@michaelschiff) agreed that it seemed reasonable

This commit fixes issue apache/druid#3960
lucesape pushed a commit to repairnator/repairnator-experiments that referenced this pull request Mar 30, 2017
This was mentioned in the original pull (apache/druid#2410) by @sascha-coenen, and the original author (@michaelschiff) agreed that it seemed reasonable

This commit fixes issue apache/druid#3960
lucesape pushed a commit to surli/librepair-XP that referenced this pull request Apr 6, 2017
This was mentioned in the original pull (apache/druid#2410) by @sascha-coenen, and the original author (@michaelschiff) agreed that it seemed reasonable

This commit fixes issue apache/druid#3960
lucesape pushed a commit to surli/librepair-XP that referenced this pull request Apr 8, 2017
This was mentioned in the original pull (apache/druid#2410) by @sascha-coenen, and the original author (@michaelschiff) agreed that it seemed reasonable

This commit fixes issue apache/druid#3960
lucesape pushed a commit to surli/librepair-XP that referenced this pull request Apr 8, 2017
This was mentioned in the original pull (apache/druid#2410) by @sascha-coenen, and the original author (@michaelschiff) agreed that it seemed reasonable

This commit fixes issue apache/druid#3960
lucesape pushed a commit to surli/librepair-XP that referenced this pull request Apr 9, 2017
This was mentioned in the original pull (apache/druid#2410) by @sascha-coenen, and the original author (@michaelschiff) agreed that it seemed reasonable

This commit fixes issue apache/druid#3960
lucesape pushed a commit to surli/librepair-XP that referenced this pull request Apr 9, 2017
This was mentioned in the original pull (apache/druid#2410) by @sascha-coenen, and the original author (@michaelschiff) agreed that it seemed reasonable

This commit fixes issue apache/druid#3960
lucesape pushed a commit to repairnator/repairnator-experiments that referenced this pull request Apr 10, 2017
This was mentioned in the original pull (apache/druid#2410) by @sascha-coenen, and the original author (@michaelschiff) agreed that it seemed reasonable

This commit fixes issue apache/druid#3960
lucesape pushed a commit to repairnator/repairnator-experiments that referenced this pull request Apr 10, 2017
This was mentioned in the original pull (apache/druid#2410) by @sascha-coenen, and the original author (@michaelschiff) agreed that it seemed reasonable

This commit fixes issue apache/druid#3960
lucesape pushed a commit to repairnator/repairnator-experiments that referenced this pull request Apr 11, 2017
This was mentioned in the original pull (apache/druid#2410) by @sascha-coenen, and the original author (@michaelschiff) agreed that it seemed reasonable

This commit fixes issue apache/druid#3960
lucesape pushed a commit to repairnator/repairnator-experiments that referenced this pull request Apr 11, 2017
This was mentioned in the original pull (apache/druid#2410) by @sascha-coenen, and the original author (@michaelschiff) agreed that it seemed reasonable

This commit fixes issue apache/druid#3960
lucesape pushed a commit to repairnator/repairnator-experiments that referenced this pull request Apr 11, 2017
This was mentioned in the original pull (apache/druid#2410) by @sascha-coenen, and the original author (@michaelschiff) agreed that it seemed reasonable

This commit fixes issue apache/druid#3960
lucesape pushed a commit to repairnator/repairnator-experiments that referenced this pull request Apr 18, 2017
This was mentioned in the original pull (apache/druid#2410) by @sascha-coenen, and the original author (@michaelschiff) agreed that it seemed reasonable

This commit fixes issue apache/druid#3960
lucesape pushed a commit to repairnator/repairnator-experiments that referenced this pull request Apr 20, 2017
This was mentioned in the original pull (apache/druid#2410) by @sascha-coenen, and the original author (@michaelschiff) agreed that it seemed reasonable

This commit fixes issue apache/druid#3960
seoeun25 pushed a commit to seoeun25/incubator-druid that referenced this pull request Jan 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants