Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Emitter monitoring #4973

Merged
merged 10 commits into from
Nov 4, 2017
Merged

Add Emitter monitoring #4973

merged 10 commits into from
Nov 4, 2017

Conversation

leventov
Copy link
Member

@leventov leventov commented Oct 17, 2017

  • Updated to com.metamx:java-util:1.3.2, that includes everything that was formerly scattered among java-util, http-client, emitter and server-metrics libraries. Also, this update fixes this defect: Uncontrolled build up of buffers in HttpPostEmitter metamx/java-util#65
  • Adds monitoring to HttpPostEmitter and ParametrizedUriEmitter (depends on what is configured), with the following metrics:
    • emitter/buffers/totalAllocated -- gauge, normally the sum of it since a Druid instance startup should be < 5, but if this number grows indefinitely, something is wrong with your network to the emitting endpoint, likely.
    • emitter/buffers/emitQueue -- gauge, normally should stay close to zero, 1-2.
    • emitter/buffers/failed -- gauge, normally should be 0.
    • emitter/buffers/dropped - delta, normally should be 0.
    • emitter/buffers/reuseQueue -- gauge, normally should be < 5.
    • emitter/events/emitted -- delta, number of events emitted in the given period.
    • emitter/events/emitQueue -- gauge, related to emitter/buffers/emitQueue
    • emitter/events/large/emitQueue -- gauge, related to emitter/buffers/emitQueue

  • emitter/batchFilling/timeMsSum - delta, total time spend in filling emitter batches, in milliseconds.
  • emitter/batchFilling/count - delta, number of batches filled. timeMsSum / count gives "average batch filling time".
  • emitter/batchFilling/maxTimeMs - maximum time spend in filling one batch, since the last emitted.
  • emitter/batchFilling/minTimeMs - minimum time spend in filling one batch, since the last emitted.

And 8 more metrics, similar to the 4 metrics above, with prefixes emitter/successfulSending/ and emitter/failedSending/. Which mean the successful and failed batch sending over HTTP to the emitter endpoint, respectively.

The metrics of emitter/successfulSending/ could help to determine the good value for minHttpTimeoutMillis config, which is added to http and parametrized emitters.

@LazySingleton
public class EmitterMonitorProvider
{
private Monitor emitterMontor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a missing "I"? Should be "emitterMonitor" instead of "emitterMontor"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed

emitter,
ImmutableMap.of()
);
emitterMonitorProvider.setEmitterMontor(emitterMonitor);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an opinion: should we inject the emitter monitor in rather than creating it in a singleton provider?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to do this, but didn't find a way. Emitter is not injected statically, it's "chosen" in runtime. Simple injection leads to "duplicate injection" errors. If monitors are also provided via @Provider methods along with emitters, emitters still should dynamically "decide" which monitor to use. But I don't see how this approach is better than the current solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking about the @Provider way.

Anyway, my concern is on the order in which the dependencies are injected, so would there be a situation where MetricsModule gets loaded before the emitter Modules such that emitterMonitorProvider returns null because it is not set yet? If this is not an issue, then I guess your method would simpler and more straightforward.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not possible, because in the method where EmitterMonitorProvider is injected, ServiceEmitter is injected as a parameter as well. ServiceEmitter depends on concrete Emitter. So there is a direct dependency chain and Guice will call one of Emitter providers first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it 👍

@gianm
Copy link
Contributor

gianm commented Oct 20, 2017

emitter/buffers/allocated seems like it would be easier to monitor for problems as a gauge rather than delta metric. (You could check if its avg value is too high)

@gianm
Copy link
Contributor

gianm commented Oct 20, 2017

Btw, do you suspect bugs in the emitter that caused you to add these metrics?

How do these metrics get emitted if the emitter is not working right?

@@ -96,6 +98,10 @@ public MonitorScheduler getMonitorScheduler(

monitors.add(monitor);
}
Monitor emitterMontor = emitterMonitorProvider.getEmitterMonitor();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emitterMonitor (spelling)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

config.get(),
HttpClientInit.createClient(builder.build(), LifecycleUtils.asMmxLifecycle(lifecycle)),
jsonMapper
);
ParametrizedUriEmitterMonitor emitterMonitor = new ParametrizedUriEmitterMonitor(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "parameterized" is the proper spelling. But the class is from another package so I guess we can't change it right now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It was called that early on not by me, I preserved. Dictionary says that this is acceptable spelling also.

FeedDefiningMonitor.DEFAULT_METRICS_FEED,
emitter
);
emitterMonitorProvider.setEmitterMonitor(emitterMonitor);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a guarantee that this will be called before MetricsModule's getMonitorScheduler is called to provide a MonitorScheduler? I guess it's just because that uses a ServiceEmitter, which needs an Emitter, which must call this method. If so, it's sort of "indirect" and seems like something that is worth a comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment

@leventov
Copy link
Member Author

@gianm

emitter/buffers/allocated seems like it would be easier to monitor for problems as a gauge rather than delta metric. (You could check if its avg value is too high)

The workflow is that if a buffer is failed to be emitted, it goes to a separate "fail queue" (reflected by emitter/buffers/failed), and if it is successfully emitted from there, it doesn't return to the reuseQueue. So the number of allocated buffers could grow indefinitely. I would keep it a delta metric, emitter/buffers/failed above zero could also allow to spot problem.

Also, I fixed description of emitter/buffers/reuseQueue, it was not correct.

Btw, do you suspect bugs in the emitter that caused you to add these metrics?

I would say not bugs, but the way the code is written, may lead to uncontrolled growth of the amount of buffers, if there are network problems with emitting. We are exploring that at the moment.

How do these metrics get emitted if the emitter is not working right?

Thanks, I didn't think about this before. At least HttpPostEmitter logs something when it couldn't emit a buffer, so it could be spotted in logs. Also in case of network problems, emitting old events doesn't block the new events, so if problems are gone, we could see new events. But in this case the information about problems could be lost, if we don't have gauge emitter/buffers/allocated metric. So I think I'm going to change it's to gauge indeed.

@gianm
Copy link
Contributor

gianm commented Oct 20, 2017

Thanks for the additional comments @leventov. I can review again when this is no longer WIP.

@leventov
Copy link
Member Author

Adding to 0.11.0 milestone because it's rather critical to fix metamx/java-util#65, that is in java-util:1.1

@jihoonson
Copy link
Contributor

Looks that the blocker is the bug you mentioned not the Emitter monitoring feature. Suggest to split this PR into two PRs which are for upgrading java-util and adding Emitter monitoring.

@leventov
Copy link
Member Author

@jihoonson well, this PR would look like it only adds monitoring, but it automatically would fix emitter bug as well, because those changes are in the same java-util version

@leventov
Copy link
Member Author

Making PR smaller for backport by just bumping java-util version is not possible, because it breaks backward compatibility of HttpPostEmitter as well, so need the code changes

@jihoonson
Copy link
Contributor

@leventov ah ok.

@gianm
Copy link
Contributor

gianm commented Oct 30, 2017

@leventov is this still WIP or is it ready to review?

@leventov
Copy link
Member Author

@gianm still WIP, we just found new issues with java-util

@leventov leventov removed the WIP label Oct 31, 2017
import javax.annotation.Nullable;

@LazySingleton
public class EmitterMonitorProvider
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? Can a Monitor simply be added to druid.monitoring.monitors instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so basically my suggestion is adding in some monitors that can be used in druid.monitoring.monitors for the emitters in question, rather than having a special monitor case in the workflow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #4973 (comment). Also I think that more complicated code is better than requirement for extra config. I think that minimizing the amount of config and "good by default" behaviour is the top priority.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think most users would be able to act on these metrics in any meaningful way. Unless there's some kind of action people can take, then it is a set of metrics mostly used for debugging purposes or monitoring by advanced setups. Plus there's a lot of monitors that are very handy for standard use that we don't include by default.

There's also the consideration of code maintainability. Having special cases for different kinds of monitors in https://github.com/druid-io/druid/pull/4973/files#diff-cba74f4106709a96941c9a365e22eddcR104 is against the spirit of what that class is trying to do. The monitoring approach here also has a reverse-injection kind of approach where the thing which is used to inject is injected so the injection can back propagate to where it is used, which is a bit of a tangle for maintainability.

A cleaner impl would have a monitor which checks the injected emitter to see if it can appropriately monitor it, and if not just ignore any monitoring (maybe with a warning).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drcrallen there will be a problem, that such emitter will be able to support only emitters, included into the druid codebase, namely HttpPostEmitter and ParametrizedUriEmitter, but not emitters, defined in private extensions. Do you think it's ok?

Copy link
Contributor

@gianm gianm Nov 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drcrallen there will be a problem, that such emitter will be able to support only emitters, included into the druid codebase, namely HttpPostEmitter and ParametrizedUriEmitter, but not emitters, defined in private extensions. Do you think it's ok?

Are you asking if it's ok if a builtin emitter-monitor only works on builtin emitters? Just making sure I understand the question.

If so, I think that's fine, as long as extensions defining their own emitters could also define their own emitter-monitors. (I think this should be possible already, even before this patch, but maybe I'm missing something)

@leventov leventov added the WIP label Nov 1, 2017
pom.xml Outdated
<artifactId>server-metrics</artifactId>
<version>0.5.2</version>
<artifactId>java-util</artifactId>
<version>1.3.0</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still the best version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@leventov leventov removed the WIP label Nov 2, 2017
@leventov
Copy link
Member Author

leventov commented Nov 3, 2017

@gianm do you have more comments?

}
return new HttpPostEmitter(config.get(), HttpClientInit.createClient(builder.build(), LifecycleUtils.asMmxLifecycle(lifecycle)), jsonMapper);
final AsyncHttpClient client = new DefaultAsyncHttpClient(builder.build());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The emitter doesn't use the metamx http-client anymore? Out of curiosity why is this?

(ParametrizedUriEmitter) emitter
);
} else {
throw new IllegalStateException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow up, I'd suggest detecting ComposingEmitter and unpacking that to monitor the emitters within. And probably in that case, we'll want this to be a warning rather than an error if there is at least one emitter that can be monitored.

Does not need to block this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would a ComposingEmitterMonitor be a better place for such functionality?

@gianm gianm merged commit 5eb08c2 into apache:master Nov 4, 2017
@leventov leventov deleted the emitter-monitoring branch November 4, 2017 09:11
leventov added a commit to metamx/druid that referenced this pull request Nov 7, 2017
* Add Emitter monitoring

* Fix typo

* Fixes

* testing new emitter

* Fix failed test (#71)

* testing new emitter

* fix on failed test

* Remove emitter's readTimeout from docs

* Update docs

* Add HttpEmittingMonitor

* Update java-util to 1.3.2
himanshug pushed a commit that referenced this pull request Nov 7, 2017
* Add Emitter monitoring

* Fix typo

* Fixes

* testing new emitter

* Fix failed test (#71)

* testing new emitter

* fix on failed test

* Remove emitter's readTimeout from docs

* Update docs

* Add HttpEmittingMonitor

* Update java-util to 1.3.2
@himanshug
Copy link
Contributor

I'm not really sure but it appears that after this PR, sometimes peon process jvm fails to exit due to possibly following non-daemon thread

"pool-39-thread-1" #193 prio=5 os_prio=0 tid=0x00002b178d2b9800 nid=0x1b70 waiting on condition [0x00002b15fd907000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at io.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:560)
	at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:459)
	at java.lang.Thread.run(Thread.java:748)

did we add a new non-daemon thread pool somewhere in java-util or http client etc ?

@himanshug
Copy link
Contributor

also I have seen errors like

Exception in thread "main" com.google.inject.CreationException: Unable to create injector, see the following errors:

1) Error in custom provider, java.lang.NoSuchMethodError: io.netty.util.internal.PlatformDependent.hasDirectBufferNoCleanerConstructor()Z
  at io.druid.server.emitter.HttpEmitterModule.getEmitter(HttpEmitterModule.java:81)

again not sure if really related to this PR

@leventov
Copy link
Member Author

leventov commented Nov 8, 2017

@himanshug which version of netty do you use?

@pjain1
Copy link
Member

pjain1 commented Nov 8, 2017

Seen this as well, again not sure if its related

Exception in thread "main" com.google.inject.CreationException: Unable to create injector, see the following errors:

1) Error in custom provider, java.lang.AbstractMethodError: io.netty.util.concurrent.MultithreadEventExecutorGroup.newChild(Ljava/util/concurrent/ThreadFactory
;[Ljava/lang/Object;)Lio/netty/util/concurrent/EventExecutor;
  at io.druid.server.emitter.HttpEmitterModule.getEmitter(HttpEmitterModule.java:81) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.
inject.util.Modules$OverrideModule -> io.druid.server.emitter.EmitterModule -> io.druid.server.emitter.HttpEmitterModule)
  at io.druid.server.emitter.HttpEmitterModule.getEmitter(HttpEmitterModule.java:81) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> io.druid.server.emitter.EmitterModule -> io.druid.server.emitter.HttpEmitterModule)
  while locating com.metamx.emitter.core.Emitter annotated with @com.google.inject.name.Named(value=http)
  at io.druid.server.emitter.EmitterModule.getServiceEmitter(EmitterModule.java:102) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> io.druid.server.emitter.EmitterModule)
  at io.druid.server.emitter.EmitterModule.getServiceEmitter(EmitterModule.java:102) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> io.druid.server.emitter.EmitterModule)
  while locating com.metamx.emitter.service.ServiceEmitter
    for the 4th parameter of io.druid.server.audit.SQLAuditManagerProvider.<init>(SQLAuditManagerProvider.java:50)
  while locating io.druid.server.audit.SQLAuditManagerProvider
  at io.druid.guice.SQLMetadataStorageDruidModule.configure(SQLMetadataStorageDruidModule.java:147) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> io.druid.metadata.storage.derby.DerbyMetadataStorageDruidModule)
  while locating io.druid.server.audit.AuditManagerProvider annotated with @com.google.inject.multibindings.Element(setName=,uniqueId=45, type=MAPBINDER, keyType=java.lang.String)
  at io.druid.guice.PolyBind.createChoiceWithDefault(PolyBind.java:108) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> io.druid.metadata.storage.derby.DerbyMetadataStorageDruidModule)
  while locating io.druid.server.audit.AuditManagerProvider
  at io.druid.cli.CliOverlord$1.configure(CliOverlord.java:186) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> io.druid.cli.CliOverlord$1)
  while locating io.druid.audit.AuditManager
    for the 3rd parameter of io.druid.common.config.JacksonConfigManager.<init>(JacksonConfigManager.java:48)
  at io.druid.guice.JacksonConfigManagerModule.configure(JacksonConfigManagerModule.java:41) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> io.druid.guice.JacksonConfigManagerModule)
  while locating io.druid.common.config.JacksonConfigManager
    for the 1st parameter of io.druid.guice.JacksonConfigProvider.configure(JacksonConfigProvider.java:80)
  at io.druid.guice.JacksonConfigProvider.bind(JacksonConfigProvider.java:38) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> io.druid.cli.CliOverlord$1)

@leventov
Copy link
Member Author

leventov commented Nov 8, 2017

#5059

@himanshug
Copy link
Contributor

we're getting following netty related jars bundled right now from druid master...

-rwxr-xr-x  2 root wheel  1292696 Nov  7 22:48 netty-3.10.6.Final.jar
-rwxr-xr-x  2 root wheel  3659852 Nov  7 22:48 netty-all-4.1.11.Final.jar
-rwxr-xr-x  2 root wheel   234592 Nov  7 22:48 netty-buffer-4.0.52.Final.jar
-rwxr-xr-x  2 root wheel   151008 Nov  7 22:48 netty-codec-4.0.52.Final.jar
-rwxr-xr-x  2 root wheel    55155 Nov  7 22:48 netty-codec-dns-2.0.37.jar
-rwxr-xr-x  2 root wheel   460925 Nov  7 22:48 netty-codec-http-4.0.52.Final.jar
-rwxr-xr-x  2 root wheel   398158 Nov  7 22:48 netty-common-4.0.52.Final.jar
-rwxr-xr-x  2 root wheel   358962 Nov  7 22:48 netty-handler-4.0.52.Final.jar
-rwxr-xr-x  2 root wheel    20938 Nov  7 22:48 netty-reactive-streams-1.0.8.jar
-rwxr-xr-x  2 root wheel    36585 Nov  7 22:48 netty-resolver-2.0.37.jar
-rwxr-xr-x  2 root wheel    89181 Nov  7 22:48 netty-resolver-dns-2.0.37.jar
-rwxr-xr-x  2 root wheel   387718 Nov  7 22:48 netty-transport-4.0.52.Final.jar
-rwxr-xr-x  2 root wheel   151894 Nov  7 22:48 netty-transport-native-epoll-4.0.52.Final-linux-x86_64.jar

@drcrallen drcrallen mentioned this pull request Oct 4, 2018
leventov added a commit that referenced this pull request Oct 15, 2018
…#6425)

@gianm gauge `emitter/buffers/failed` and `emitter/buffers/totalAllocated` metrics proved to be very inconvenient in practice: they are not additive across a fleet of nodes, are randomly reset when a JVM restarts (that may be completely independent from metrics emitting issues). Regarding [problem identification](#4973 (comment)), `emitter/buffers/dropped` serves this well.

Currently in this PR I added `/delta` vs. `/total` metrics, but if there are no objections I would remove `/total` and so align `emitter/buffers/allocated` and `/failed` with `/dropped` and `/events/emitted` which are already delta-only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants