Add Emitter monitoring #4973

leventov · 2017-10-17T22:54:21Z

Updated to com.metamx:java-util:1.3.2, that includes everything that was formerly scattered among java-util, http-client, emitter and server-metrics libraries. Also, this update fixes this defect: Uncontrolled build up of buffers in HttpPostEmitter metamx/java-util#65
Adds monitoring to HttpPostEmitter and ParametrizedUriEmitter (depends on what is configured), with the following metrics:
- emitter/buffers/totalAllocated -- gauge, normally the sum of it since a Druid instance startup should be < 5, but if this number grows indefinitely, something is wrong with your network to the emitting endpoint, likely.
- emitter/buffers/emitQueue -- gauge, normally should stay close to zero, 1-2.
- emitter/buffers/failed -- gauge, normally should be 0.
- emitter/buffers/dropped - delta, normally should be 0.
- emitter/buffers/reuseQueue -- gauge, normally should be < 5.
- emitter/events/emitted -- delta, number of events emitted in the given period.
- emitter/events/emitQueue -- gauge, related to emitter/buffers/emitQueue
- emitter/events/large/emitQueue -- gauge, related to emitter/buffers/emitQueue

emitter/batchFilling/timeMsSum - delta, total time spend in filling emitter batches, in milliseconds.
emitter/batchFilling/count - delta, number of batches filled. timeMsSum / count gives "average batch filling time".
emitter/batchFilling/maxTimeMs - maximum time spend in filling one batch, since the last emitted.
emitter/batchFilling/minTimeMs - minimum time spend in filling one batch, since the last emitted.

And 8 more metrics, similar to the 4 metrics above, with prefixes emitter/successfulSending/ and emitter/failedSending/. Which mean the successful and failed batch sending over HTTP to the emitter endpoint, respectively.

The metrics of emitter/successfulSending/ could help to determine the good value for minHttpTimeoutMillis config, which is added to http and parametrized emitters.

xanec · 2017-10-19T20:53:52Z

server/src/main/java/io/druid/server/emitter/EmitterMonitorProvider.java

+@LazySingleton
+public class EmitterMonitorProvider
+{
+  private Monitor emitterMontor;


I think there is a missing "I"? Should be "emitterMonitor" instead of "emitterMontor"

Thanks, fixed

xanec · 2017-10-20T00:05:14Z

server/src/main/java/io/druid/server/emitter/HttpEmitterModule.java

+        emitter,
+        ImmutableMap.of()
+    );
+    emitterMonitorProvider.setEmitterMontor(emitterMonitor);


Just an opinion: should we inject the emitter monitor in rather than creating it in a singleton provider?

I tried to do this, but didn't find a way. Emitter is not injected statically, it's "chosen" in runtime. Simple injection leads to "duplicate injection" errors. If monitors are also provided via @Provider methods along with emitters, emitters still should dynamically "decide" which monitor to use. But I don't see how this approach is better than the current solution.

Yeah, I was thinking about the @Provider way.

Anyway, my concern is on the order in which the dependencies are injected, so would there be a situation where MetricsModule gets loaded before the emitter Modules such that emitterMonitorProvider returns null because it is not set yet? If this is not an issue, then I guess your method would simpler and more straightforward.

This is not possible, because in the method where EmitterMonitorProvider is injected, ServiceEmitter is injected as a parameter as well. ServiceEmitter depends on concrete Emitter. So there is a direct dependency chain and Guice will call one of Emitter providers first.

Got it 👍

gianm · 2017-10-20T06:37:44Z

emitter/buffers/allocated seems like it would be easier to monitor for problems as a gauge rather than delta metric. (You could check if its avg value is too high)

gianm · 2017-10-20T06:38:53Z

Btw, do you suspect bugs in the emitter that caused you to add these metrics?

How do these metrics get emitted if the emitter is not working right?

gianm · 2017-10-20T06:39:37Z

server/src/main/java/io/druid/server/metrics/MetricsModule.java

@@ -96,6 +98,10 @@ public MonitorScheduler getMonitorScheduler(

      monitors.add(monitor);
    }
+    Monitor emitterMontor = emitterMonitorProvider.getEmitterMonitor();


emitterMonitor (spelling)

gianm · 2017-10-20T06:40:43Z

server/src/main/java/io/druid/server/emitter/ParametrizedUriEmitterModule.java

        config.get(),
        HttpClientInit.createClient(builder.build(), LifecycleUtils.asMmxLifecycle(lifecycle)),
        jsonMapper
    );
+    ParametrizedUriEmitterMonitor emitterMonitor = new ParametrizedUriEmitterMonitor(


I think "parameterized" is the proper spelling. But the class is from another package so I guess we can't change it right now.

Yes, It was called that early on not by me, I preserved. Dictionary says that this is acceptable spelling also.

gianm · 2017-10-20T06:44:45Z

server/src/main/java/io/druid/server/emitter/ParametrizedUriEmitterModule.java

+        FeedDefiningMonitor.DEFAULT_METRICS_FEED,
+        emitter
+    );
+    emitterMonitorProvider.setEmitterMonitor(emitterMonitor);


Is there a guarantee that this will be called before MetricsModule's getMonitorScheduler is called to provide a MonitorScheduler? I guess it's just because that uses a ServiceEmitter, which needs an Emitter, which must call this method. If so, it's sort of "indirect" and seems like something that is worth a comment.

Added comment

leventov · 2017-10-20T18:09:11Z

@gianm

emitter/buffers/allocated seems like it would be easier to monitor for problems as a gauge rather than delta metric. (You could check if its avg value is too high)

The workflow is that if a buffer is failed to be emitted, it goes to a separate "fail queue" (reflected by emitter/buffers/failed), and if it is successfully emitted from there, it doesn't return to the reuseQueue. So the number of allocated buffers could grow indefinitely. I would keep it a delta metric, emitter/buffers/failed above zero could also allow to spot problem.

Also, I fixed description of emitter/buffers/reuseQueue, it was not correct.

Btw, do you suspect bugs in the emitter that caused you to add these metrics?

I would say not bugs, but the way the code is written, may lead to uncontrolled growth of the amount of buffers, if there are network problems with emitting. We are exploring that at the moment.

How do these metrics get emitted if the emitter is not working right?

Thanks, I didn't think about this before. At least HttpPostEmitter logs something when it couldn't emit a buffer, so it could be spotted in logs. Also in case of network problems, emitting old events doesn't block the new events, so if problems are gone, we could see new events. But in this case the information about problems could be lost, if we don't have gauge emitter/buffers/allocated metric. So I think I'm going to change it's to gauge indeed.

gianm · 2017-10-20T20:09:56Z

Thanks for the additional comments @leventov. I can review again when this is no longer WIP.

leventov · 2017-10-28T02:51:01Z

Adding to 0.11.0 milestone because it's rather critical to fix metamx/java-util#65, that is in java-util:1.1

jihoonson · 2017-10-28T02:55:56Z

Looks that the blocker is the bug you mentioned not the Emitter monitoring feature. Suggest to split this PR into two PRs which are for upgrading java-util and adding Emitter monitoring.

leventov · 2017-10-28T02:57:39Z

@jihoonson well, this PR would look like it only adds monitoring, but it automatically would fix emitter bug as well, because those changes are in the same java-util version

leventov · 2017-10-28T02:59:19Z

Making PR smaller for backport by just bumping java-util version is not possible, because it breaks backward compatibility of HttpPostEmitter as well, so need the code changes

jihoonson · 2017-10-28T03:02:21Z

@leventov ah ok.

gianm · 2017-10-30T22:23:25Z

@leventov is this still WIP or is it ready to review?

leventov · 2017-10-30T22:27:37Z

@gianm still WIP, we just found new issues with java-util

* testing new emitter * fix on failed test

drcrallen · 2017-10-31T23:28:03Z

server/src/main/java/io/druid/server/emitter/EmitterMonitorProvider.java

+import javax.annotation.Nullable;
+
+@LazySingleton
+public class EmitterMonitorProvider


Is this needed? Can a Monitor simply be added to druid.monitoring.monitors instead?

so basically my suggestion is adding in some monitors that can be used in druid.monitoring.monitors for the emitters in question, rather than having a special monitor case in the workflow.

See #4973 (comment). Also I think that more complicated code is better than requirement for extra config. I think that minimizing the amount of config and "good by default" behaviour is the top priority.

I don't think most users would be able to act on these metrics in any meaningful way. Unless there's some kind of action people can take, then it is a set of metrics mostly used for debugging purposes or monitoring by advanced setups. Plus there's a lot of monitors that are very handy for standard use that we don't include by default.

There's also the consideration of code maintainability. Having special cases for different kinds of monitors in https://github.com/druid-io/druid/pull/4973/files#diff-cba74f4106709a96941c9a365e22eddcR104 is against the spirit of what that class is trying to do. The monitoring approach here also has a reverse-injection kind of approach where the thing which is used to inject is injected so the injection can back propagate to where it is used, which is a bit of a tangle for maintainability.

A cleaner impl would have a monitor which checks the injected emitter to see if it can appropriately monitor it, and if not just ignore any monitoring (maybe with a warning).

@drcrallen there will be a problem, that such emitter will be able to support only emitters, included into the druid codebase, namely HttpPostEmitter and ParametrizedUriEmitter, but not emitters, defined in private extensions. Do you think it's ok?

@drcrallen there will be a problem, that such emitter will be able to support only emitters, included into the druid codebase, namely HttpPostEmitter and ParametrizedUriEmitter, but not emitters, defined in private extensions. Do you think it's ok?

Are you asking if it's ok if a builtin emitter-monitor only works on builtin emitters? Just making sure I understand the question.

If so, I think that's fine, as long as extensions defining their own emitters could also define their own emitter-monitors. (I think this should be possible already, even before this patch, but maybe I'm missing something)

drcrallen · 2017-11-02T18:49:53Z

pom.xml

-                <artifactId>server-metrics</artifactId>
-                <version>0.5.2</version>
+                <artifactId>java-util</artifactId>
+                <version>1.3.0</version>


is this still the best version?

leventov · 2017-11-03T14:12:58Z

@gianm do you have more comments?

gianm · 2017-11-04T00:05:45Z

server/src/main/java/io/druid/server/emitter/HttpEmitterModule.java

    }
-    return new HttpPostEmitter(config.get(), HttpClientInit.createClient(builder.build(), LifecycleUtils.asMmxLifecycle(lifecycle)), jsonMapper);
+    final AsyncHttpClient client = new DefaultAsyncHttpClient(builder.build());


The emitter doesn't use the metamx http-client anymore? Out of curiosity why is this?

gianm · 2017-11-04T03:27:15Z

server/src/main/java/io/druid/server/emitter/HttpEmittingMonitor.java

+          (ParametrizedUriEmitter) emitter
+      );
+    } else {
+      throw new IllegalStateException(


As a follow up, I'd suggest detecting ComposingEmitter and unpacking that to monitor the emitters within. And probably in that case, we'll want this to be a warning rather than an error if there is at least one emitter that can be monitored.

Does not need to block this PR.

would a ComposingEmitterMonitor be a better place for such functionality?

* Add Emitter monitoring * Fix typo * Fixes * testing new emitter * Fix failed test (#71) * testing new emitter * fix on failed test * Remove emitter's readTimeout from docs * Update docs * Add HttpEmittingMonitor * Update java-util to 1.3.2

himanshug · 2017-11-08T16:25:52Z

I'm not really sure but it appears that after this PR, sometimes peon process jvm fails to exit due to possibly following non-daemon thread

"pool-39-thread-1" #193 prio=5 os_prio=0 tid=0x00002b178d2b9800 nid=0x1b70 waiting on condition [0x00002b15fd907000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at io.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:560)
	at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:459)
	at java.lang.Thread.run(Thread.java:748)

did we add a new non-daemon thread pool somewhere in java-util or http client etc ?

himanshug · 2017-11-08T21:01:33Z

also I have seen errors like

Exception in thread "main" com.google.inject.CreationException: Unable to create injector, see the following errors:

1) Error in custom provider, java.lang.NoSuchMethodError: io.netty.util.internal.PlatformDependent.hasDirectBufferNoCleanerConstructor()Z
  at io.druid.server.emitter.HttpEmitterModule.getEmitter(HttpEmitterModule.java:81)

again not sure if really related to this PR

leventov · 2017-11-08T21:05:28Z

@himanshug which version of netty do you use?

pjain1 · 2017-11-08T21:05:51Z

Seen this as well, again not sure if its related

Exception in thread "main" com.google.inject.CreationException: Unable to create injector, see the following errors:

1) Error in custom provider, java.lang.AbstractMethodError: io.netty.util.concurrent.MultithreadEventExecutorGroup.newChild(Ljava/util/concurrent/ThreadFactory
;[Ljava/lang/Object;)Lio/netty/util/concurrent/EventExecutor;
  at io.druid.server.emitter.HttpEmitterModule.getEmitter(HttpEmitterModule.java:81) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.
inject.util.Modules$OverrideModule -> io.druid.server.emitter.EmitterModule -> io.druid.server.emitter.HttpEmitterModule)
  at io.druid.server.emitter.HttpEmitterModule.getEmitter(HttpEmitterModule.java:81) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> io.druid.server.emitter.EmitterModule -> io.druid.server.emitter.HttpEmitterModule)
  while locating com.metamx.emitter.core.Emitter annotated with @com.google.inject.name.Named(value=http)
  at io.druid.server.emitter.EmitterModule.getServiceEmitter(EmitterModule.java:102) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> io.druid.server.emitter.EmitterModule)
  at io.druid.server.emitter.EmitterModule.getServiceEmitter(EmitterModule.java:102) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> io.druid.server.emitter.EmitterModule)
  while locating com.metamx.emitter.service.ServiceEmitter
    for the 4th parameter of io.druid.server.audit.SQLAuditManagerProvider.<init>(SQLAuditManagerProvider.java:50)
  while locating io.druid.server.audit.SQLAuditManagerProvider
  at io.druid.guice.SQLMetadataStorageDruidModule.configure(SQLMetadataStorageDruidModule.java:147) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> io.druid.metadata.storage.derby.DerbyMetadataStorageDruidModule)
  while locating io.druid.server.audit.AuditManagerProvider annotated with @com.google.inject.multibindings.Element(setName=,uniqueId=45, type=MAPBINDER, keyType=java.lang.String)
  at io.druid.guice.PolyBind.createChoiceWithDefault(PolyBind.java:108) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> io.druid.metadata.storage.derby.DerbyMetadataStorageDruidModule)
  while locating io.druid.server.audit.AuditManagerProvider
  at io.druid.cli.CliOverlord$1.configure(CliOverlord.java:186) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> io.druid.cli.CliOverlord$1)
  while locating io.druid.audit.AuditManager
    for the 3rd parameter of io.druid.common.config.JacksonConfigManager.<init>(JacksonConfigManager.java:48)
  at io.druid.guice.JacksonConfigManagerModule.configure(JacksonConfigManagerModule.java:41) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> io.druid.guice.JacksonConfigManagerModule)
  while locating io.druid.common.config.JacksonConfigManager
    for the 1st parameter of io.druid.guice.JacksonConfigProvider.configure(JacksonConfigProvider.java:80)
  at io.druid.guice.JacksonConfigProvider.bind(JacksonConfigProvider.java:38) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> io.druid.cli.CliOverlord$1)

leventov · 2017-11-08T21:20:13Z

#5059

himanshug · 2017-11-08T21:31:44Z

we're getting following netty related jars bundled right now from druid master...

-rwxr-xr-x  2 root wheel  1292696 Nov  7 22:48 netty-3.10.6.Final.jar
-rwxr-xr-x  2 root wheel  3659852 Nov  7 22:48 netty-all-4.1.11.Final.jar
-rwxr-xr-x  2 root wheel   234592 Nov  7 22:48 netty-buffer-4.0.52.Final.jar
-rwxr-xr-x  2 root wheel   151008 Nov  7 22:48 netty-codec-4.0.52.Final.jar
-rwxr-xr-x  2 root wheel    55155 Nov  7 22:48 netty-codec-dns-2.0.37.jar
-rwxr-xr-x  2 root wheel   460925 Nov  7 22:48 netty-codec-http-4.0.52.Final.jar
-rwxr-xr-x  2 root wheel   398158 Nov  7 22:48 netty-common-4.0.52.Final.jar
-rwxr-xr-x  2 root wheel   358962 Nov  7 22:48 netty-handler-4.0.52.Final.jar
-rwxr-xr-x  2 root wheel    20938 Nov  7 22:48 netty-reactive-streams-1.0.8.jar
-rwxr-xr-x  2 root wheel    36585 Nov  7 22:48 netty-resolver-2.0.37.jar
-rwxr-xr-x  2 root wheel    89181 Nov  7 22:48 netty-resolver-dns-2.0.37.jar
-rwxr-xr-x  2 root wheel   387718 Nov  7 22:48 netty-transport-4.0.52.Final.jar
-rwxr-xr-x  2 root wheel   151894 Nov  7 22:48 netty-transport-native-epoll-4.0.52.Final-linux-x86_64.jar

@gianm

…#6425) @gianm gauge `emitter/buffers/failed` and `emitter/buffers/totalAllocated` metrics proved to be very inconvenient in practice: they are not additive across a fleet of nodes, are randomly reset when a JVM restarts (that may be completely independent from metrics emitting issues). Regarding [problem identification](#4973 (comment)), `emitter/buffers/dropped` serves this well. Currently in this PR I added `/delta` vs. `/total` metrics, but if there are no objections I would remove `/total` and so align `emitter/buffers/allocated` and `/failed` with `/dropped` and `/events/emitted` which are already delta-only.

Add Emitter monitoring

fdbd085

leventov added Area - Metrics/Event Emitting Area - Operations labels Oct 17, 2017

Merge remote-tracking branch 'upstream/master' into emitter-monitoring

6c908dc

xanec suggested changes Oct 20, 2017

View reviewed changes

Fix typo

f72c5e2

xanec approved these changes Oct 20, 2017

View reviewed changes

gianm reviewed Oct 20, 2017

View reviewed changes

Fixes

0f09b69

leventov added the WIP label Oct 20, 2017

leventov mentioned this pull request Oct 20, 2017

Uncontrolled build up of buffers in HttpPostEmitter metamx/java-util#65

Closed

leventov mentioned this pull request Oct 28, 2017

Fix improper handling of empty arrays in StringDimensionIndexer. #5012

Merged

leventov added this to the 0.11.0 milestone Oct 28, 2017

leventov and others added 3 commits October 31, 2017 17:55

testing new emitter

798e9d8

Fix failed test (#71)

208697c

* testing new emitter * fix on failed test

Remove emitter's readTimeout from docs

7f39154

leventov removed the WIP label Oct 31, 2017

Update docs

70d0a81

drcrallen reviewed Oct 31, 2017

View reviewed changes

Add HttpEmittingMonitor

de1cc0a

leventov added the WIP label Nov 1, 2017

drcrallen approved these changes Nov 2, 2017

View reviewed changes

drcrallen reviewed Nov 2, 2017

View reviewed changes

Update java-util to 1.3.2

a997bc2

leventov removed the WIP label Nov 2, 2017

gianm approved these changes Nov 4, 2017

View reviewed changes

gianm merged commit 5eb08c2 into apache:master Nov 4, 2017

leventov deleted the emitter-monitoring branch November 4, 2017 09:11

leventov added the Improvement label Nov 7, 2017

xanec mentioned this pull request Nov 8, 2017

Use daemon thread pool for AsyncHttpClient in emitters #5057

Merged

leventov mentioned this pull request Nov 8, 2017

Downgrade Netty version 4.1 -> 4.0 #5059

Merged

himanshug mentioned this pull request Nov 10, 2017

make AsyncHttpCLient instance timer thread daemon #5076

Merged

drcrallen mentioned this pull request Oct 4, 2018

Upgrade Netty to 4.1.x #6417

Merged

leventov mentioned this pull request Oct 5, 2018

Emit emitter/buffers/allocated/delta and emitter/buffers/failed/delta #6425

Merged

Add Emitter monitoring #4973

Add Emitter monitoring #4973

Conversation

leventov commented Oct 17, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm commented Oct 20, 2017

gianm commented Oct 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leventov commented Oct 20, 2017

gianm commented Oct 20, 2017

leventov commented Oct 28, 2017

jihoonson commented Oct 28, 2017

leventov commented Oct 28, 2017

leventov commented Oct 28, 2017

jihoonson commented Oct 28, 2017

gianm commented Oct 30, 2017

leventov commented Oct 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm Nov 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leventov commented Nov 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

himanshug commented Nov 8, 2017

himanshug commented Nov 8, 2017

leventov commented Nov 8, 2017

pjain1 commented Nov 8, 2017

leventov commented Nov 8, 2017

himanshug commented Nov 8, 2017

leventov commented Oct 17, 2017 •

edited

Loading

gianm Nov 1, 2017 •

edited

Loading