Add ServiceStatusMonitor to monitor service health #14443

YongGang · 2023-06-16T20:43:19Z

Description

There are cases that no leader for Overlord/Coordinator and cases that multiple leaders selected, this could happen when service is over loaded or network partition happened.

This PR adds a general heartbeat metric to indicate service health. For Overlord/Coordinator, the sum of druid/heartbeat metric with leader=1 dimension should always be one in a heathy Druid cluster.
The new metric example for Coordinator:
{"feed":"metrics","leader":1,"metric":"druid/heartbeat","service":"coordinator","host":"localhost:8081","version":"","value":1,"timestamp":"2023-06-21T21:55:53.216Z"}
For Overlord:
{"feed":"metrics","leader":1,"metric":"druid/heartbeat","service":"overlord","host":"localhost:8090","version":"","value":1,"timestamp":"2023-06-21T21:34:01.764Z"}

Other service need to provide following supplier to report druid/heartbeat metric:

@Named("heartbeat")
Supplier<Map<String, Object>> heartbeatTagsSupplier

Release note

Add a new Monitor to monitor the health of Overlord and Coordinator service.

Key changed/added classes in this PR

add ServiceStatusMonitor

This PR has:

suneet-s · 2023-06-16T21:29:36Z

Thanks for your first contribution to Druid @YongGang !

Instead of introducing one monitor per service, I'd recommend creating a single monitor that can be installed on every service. This will make it easier to use as an operator can set this monitor in the common runtime properties instead of having to configure it per service.

Instead of the leader/count metric, what do you think of introducing a druid/heartbeat metric that can be annotated with dimensions like leader for the coordinator / overlord, or task_id for the peons, or disabled for middle managers?

We do not need to implement all these ideas in this PR, but I think a heartbeat metric will be more flexible than a metric that is scoped to the leader.

suneet-s · 2023-06-16T22:07:24Z

...xing-service/src/main/java/org/apache/druid/indexing/common/stats/OverlordStatusMonitor.java

+  public boolean doMonitor(ServiceEmitter emitter) {
+    final ServiceMetricEvent.Builder builder = new ServiceMetricEvent.Builder();
+
+    builder.setDimension("serviceType", "overlord");


Since the monitor is using the ServiceEmitter, the service and host will be added auto-magically :) So this is not needed.

YongGang · 2023-06-20T05:15:32Z

Thanks for the review @suneet-s .
This is a good idea, I updated the PR to make the metric more general.
For Peon and MM, I haven't found a good way to report the health status align with what we do here though.

suneet-s · 2023-06-20T19:16:03Z

For Peon and MM, I haven't found a good way to report the health status align with what we do here though.

My recommendation would be to remove the ServiceStatusProvider that is currently implemented and replace it with an named injected @Named("heartbeat") Supplier<Map<String, Object> heartbeatTags. Then the CliCoordinator and other services can inject the Supplier based on what each service deems is a useful annotation.

YongGang · 2023-06-21T05:26:11Z

Updated to use Supplier pattern, local tested to see correct output.

kfaraz · 2023-06-21T07:10:24Z

docs/configuration/index.md

-|`org.apache.druid.server.metrics.TaskCountStatsMonitor`|Reports how many ingestion tasks are currently running/pending/waiting and also the number of successful/failed tasks per emission period.|
-|`org.apache.druid.server.metrics.TaskSlotCountStatsMonitor`|Reports metrics about task slot usage per emission period.|
-|`org.apache.druid.server.metrics.WorkerTaskCountStatsMonitor`|Reports how many ingestion tasks are currently running/pending/waiting, the number of successful/failed tasks, and metrics about task slot usage for the reporting worker, per emission period. Only supported by middleManager node types.|
+| Name                                                           | Description                                                                                                                                                                                                                                 |


I think the original formatting aligns better with the rest of the Druid docs.

kfaraz · 2023-06-21T07:10:55Z

docs/operations/metrics.md

@@ -326,6 +326,12 @@ If `emitBalancingStats` is set to `true` in the Coordinator [dynamic configurati

 ## General Health

+### Overlord/Coordinator
+
+| Metric         | Description                                                                                                   | Dimensions      | Normal Value |


Please use the formatting style used in the rest of the Druid docs.

Updated. Intellij keep reformatting the doc, interesting.

kfaraz · 2023-06-21T07:17:52Z

server/src/main/java/org/apache/druid/server/metrics/ServiceStatusMonitor.java

+      return true;
+    }
+
+    heartbeatTagsSupplier.get().forEach((k, v) -> {


If I am not mistaken, the tags should each be a separate dimension and not be emitted as metric values.
The metric value will always be 1, as it is a simple count.

In the current code, you would be emitting the druid/heartbeat metric multiple times in every invocation of doMonitor.

Not sure I understand.
For Overlord/Coordinator, this druid/heartbeat metric will be 1 with heartbeatType dimension set to leader. And since this map Map<String, Number> only have one entry, so the metric only reported once per doMonitor call.
For other potential service (or component within the service) to use this monitor, druid/heartbeat metric doesn't have to be 1, that's why heartbeatType dimension is introduced as the metric can have different meaning for different heartbeatType.

I think what Kashif is mentioning here is that heartbeatTagsSupplier should be a map of dimension keys to values and the metric that is reported for the heartbeat is always a constant.

Thanks, updated the code with dimensions configured.

kfaraz · 2023-06-21T07:18:07Z

server/src/test/java/org/apache/druid/server/metrics/ServiceStatusMonitorTest.java

+    Assert.assertEquals("druid/heartbeat", emitter.getEvents().get(0).toMap().get("metric"));
+    Assert.assertEquals(1, emitter.getEvents().get(0).toMap().get("value"));
+  }
+}


Nit: newline at end of file.

kfaraz · 2023-06-21T07:19:30Z

server/src/main/java/org/apache/druid/server/metrics/ServiceStatusMonitor.java

+      return true;
+    }
+
+    heartbeatTagsSupplier.get().forEach((k, v) -> {


Please add a null check on the supplier, just in case someone wants to use this monitor for other services, where the tags supplier is not being injected.

To make the heartbeat work on other services, I think the pattern in this monitor should be

ServiceMetricEventBuilder builder = ...; if (heartbeatDimensions is not empty) { heartbeatDimensions.forEach(builder.setDimension(k, v); } emitter.emit(builder.build("druid/heartbeat", 1);

kfaraz · 2023-06-21T07:19:59Z

server/src/main/java/org/apache/druid/server/metrics/ServiceStatusMonitor.java

+  Supplier<Map<String, Number>> heartbeatTagsSupplier = null;
+
+  @Inject
+  public ServiceStatusMonitor() {


Is the empty constructor needed?

kfaraz · 2023-06-21T07:21:11Z

docs/operations/metrics.md

@@ -326,6 +326,12 @@ If `emitBalancingStats` is set to `true` in the Coordinator [dynamic configurati

 ## General Health

+### Overlord/Coordinator


I think it can be used for all services, even though it might not be very useful right now except for coordinator and overlord.

suneet-s

I think we're closer now.

Can you please update the description of the PR to reflect the new design and include examples of what the metrics look like once the feedback is incorporated.

suneet-s · 2023-06-21T19:08:55Z

server/src/main/java/org/apache/druid/server/metrics/ServiceStatusMonitor.java

+      return true;
+    }
+
+    heartbeatTagsSupplier.get().forEach((k, v) -> {


I think what Kashif is mentioning here is that heartbeatTagsSupplier should be a map of dimension keys to values and the metric that is reported for the heartbeat is always a constant.

suneet-s · 2023-06-21T19:12:25Z

server/src/main/java/org/apache/druid/server/metrics/ServiceStatusMonitor.java

+      return true;
+    }
+
+    heartbeatTagsSupplier.get().forEach((k, v) -> {


To make the heartbeat work on other services, I think the pattern in this monitor should be

ServiceMetricEventBuilder builder = ...; if (heartbeatDimensions is not empty) { heartbeatDimensions.forEach(builder.setDimension(k, v); } emitter.emit(builder.build("druid/heartbeat", 1);

YongGang · 2023-06-21T22:28:32Z

Updated the PR description to reflect the new design.

suneet-s

Looks good once the docs + tests are updated!

suneet-s · 2023-06-22T00:05:22Z

docs/operations/metrics.md

+
+|Metric|Description|Dimensions|Normal Value|
+|------|-----------|----------|------------|
+|`druid/heartbeat`| Report service health. For Overlord/Coordinator, the dimension is leader count. `ServiceStatusMonitor` must be enabled. |`heartbeatType`|1|


suneet-s · 2023-06-22T00:06:09Z

docs/configuration/index.md

@@ -399,6 +399,7 @@ Metric monitoring is an essential part of Druid operations.  The following monit
 |`org.apache.druid.server.metrics.TaskCountStatsMonitor`|Reports how many ingestion tasks are currently running/pending/waiting and also the number of successful/failed tasks per emission period.|
 |`org.apache.druid.server.metrics.TaskSlotCountStatsMonitor`|Reports metrics about task slot usage per emission period.|
 |`org.apache.druid.server.metrics.WorkerTaskCountStatsMonitor`|Reports how many ingestion tasks are currently running/pending/waiting, the number of successful/failed tasks, and metrics about task slot usage for the reporting worker, per emission period. Only supported by middleManager node types.|
+| `org.apache.druid.server.metrics.ServiceStatusMonitor`|Reports service heartbeat. For overlord/coordinator, the number is leader count. Only supported by overlord/coordinator node types.|


Stale doc

Suggested change

| `org.apache.druid.server.metrics.ServiceStatusMonitor`|Reports service heartbeat. For overlord/coordinator, the number is leader count. Only supported by overlord/coordinator node types.|

| `org.apache.druid.server.metrics.ServiceStatusMonitor`|Reports a heartbeat for the service. |

server/src/main/java/org/apache/druid/server/metrics/ServiceStatusMonitor.java

server/src/test/java/org/apache/druid/server/metrics/ServiceStatusMonitorTest.java

suneet-s · 2023-06-22T00:14:54Z

server/src/test/java/org/apache/druid/server/metrics/ServiceStatusMonitorTest.java

+    heartbeatTags.put("leader", 1);
+    emitter.flush();
+    monitor.doMonitor(emitter);
+    Assert.assertEquals(1, emitter.getEvents().size());
+    Assert.assertEquals(1, emitter.getEvents().get(0).toMap().get("leader"));
+    Assert.assertEquals("druid/heartbeat", emitter.getEvents().get(0).toMap().get("metric"));
+    Assert.assertEquals(1, emitter.getEvents().get(0).toMap().get("value"));


I don't think this part of the test is needed. Instead, can you please add a test for adding more than 1 dimension to the metric.

Suggested change

heartbeatTags.put("leader", 1);

emitter.flush();

monitor.doMonitor(emitter);

Assert.assertEquals(1, emitter.getEvents().size());

Assert.assertEquals(1, emitter.getEvents().get(0).toMap().get("leader"));

Assert.assertEquals("druid/heartbeat", emitter.getEvents().get(0).toMap().get("metric"));

Assert.assertEquals(1, emitter.getEvents().get(0).toMap().get("value"));

And another test for no dimensions in the heartbeatTagsSupplier

kfaraz · 2023-06-22T03:14:13Z

docs/operations/metrics.md

+
+|Metric|Description|Dimensions|Normal Value|
+|------|-----------|----------|------------|
+|`druid/heartbeat`| Report service health. For Overlord/Coordinator, the dimension is leader count. `ServiceStatusMonitor` must be enabled. |`heartbeatType`|1|


Nit: I think we should prefix this metric with either cluster/ or server/ so that this metric becomes cluster/heartbeat or server/heartbeat. We can add other relevant metrics later which have the same prefix.

The prefix druid/ doesn't seem to give any info about the metric.
@suneet-s , @YongGang , what do you think?

I don't have a strong preference. We could also just call the metric heartbeat instead of druid/hearbeat

The Druid process might be running in a container, in which case server can be misleading. Do you have some examples in mind of other metrics that would live under cluster/ that wouldn't make sense under druid/?

No, I don't have a concrete example in mind either. I was thinking mostly of any cluster-level information, inter-service communication, etc.

I agree with you that server/ can be misleading when running on containers. I avoided using it in a PR due to similar reasons. How about service/ as an alternative? :)

I am working on a PR where I think I am going to use cluster/ for server view syncs. e.g. cluster/serverview/synced which denotes the sync status between coordinator/broker inventory and different historical/peon processes.

druid/ is pretty much a catch-all and any metric that goes under cluster/ could potentially go under druid/ as well. But I generally try to use prefixes that make the metrics a little more user-friendly and somewhat self-explanatory.

service/ prefix seems fine to me as the heartbeat doesn't have to be on cluster level.

suneet-s

Nice! Thanks @YongGang !

suneet-s · 2023-06-22T21:44:37Z

The intelliJ inspections failure looks legitimate

Error:  server/src/test/java/org/apache/druid/server/metrics/ServiceStatusMonitorTest.java:40 -- The declared exception <code>Exception</code> is never thrown
Error: Process completed with exit code 1.
0s

I have re-triggered the other failing jobs 🤞

YongGang · 2023-06-22T22:18:30Z

The intelliJ inspections failure looks legitimate
Error:  server/src/test/java/org/apache/druid/server/metrics/ServiceStatusMonitorTest.java:40 -- The declared exception <code>Exception</code> is never thrown
Error: Process completed with exit code 1.
0s
I have re-triggered the other failing jobs 🤞

I removed the Exception declaration (was added by IntelliJ auto gen code).

kfaraz

Thanks for your first PR, @YongGang !

…vice leader status

YongGang · 2023-06-25T19:47:20Z

I rebased from master again hope to fix the IT (ITNestedQueryPushDownTest) failure.
Although the error said failed to get leader, I couldn't think of a way this PR caused this as the ServiceStatusMonitor has been registered to run.

2023-06-24T04:25:40,322 INFO [main] org.apache.druid.testing.utils.ITRetryUtil - Trying attempt[1/240]...
2023-06-24T04:25:40,342 INFO [main] org.apache.druid.testing.utils.DruidClusterAdminClient - 500 Server Error <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 500 java.lang.RuntimeException: org.apache.druid.java.util.common.RE: Failed  to get current leader for [druid-it-leaderelection-coordinator]</title>
</head>
<body><h2>HTTP ERROR 500 java.lang.RuntimeException: org.apache.druid.java.util.common.RE: Failed  to get current leader for [druid-it-leaderelection-coordinator]</h2>
<table>
<tr><th>URI:</th><td>/druid/coordinator/v1/config</td></tr>
<tr><th>STATUS:</th><td>500</td></tr>
<tr><th>MESSAGE:</th><td>java.lang.RuntimeException: org.apache.druid.java.util.common.RE: Failed  to get current leader for [druid-it-leaderelection-coordinator]</td></tr>
<tr><th>SERVLET:</th><td>org.apache.druid.server.AsyncManagementForwardingServlet-36cdcae0</td></tr>
<tr><th>CAUSED BY:</th><td>java.lang.RuntimeException: org.apache.druid.java.util.common.RE: Failed  to get current leader for [druid-it-leaderelection-coordinator]</td></tr>
<tr><th>CAUSED BY:</th><td>org.apache.druid.java.util.common.RE: Failed  to get current leader for [druid-it-leaderelection-coordinator]</td></tr>
<tr><th>CAUSED BY:</th><td>io.kubernetes.client.openapi.ApiException: Not Found</td></tr>
</table>
<h3>Caused by:</h3><pre>java.lang.RuntimeException: org.apache.druid.java.util.common.RE: Failed  to get current leader for [druid-it-leaderelection-coordinator]
	at org.apache.druid.k8s.discovery.K8sDruidLeaderSelector.getCurrentLeader(K8sDruidLeaderSelector.java:112)
	at org.apache.druid.server.AsyncManagementForwardingServlet.service(AsyncManagementForwardingServlet.java:94)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
	at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1656)
	at org.apache.druid.server.security.PreResponseAuthorizationCheckFilter.doFilter(PreResponseAuthorizationCheckFilter.java:84)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
	at org.apache.druid.server.initialization.jetty.StandardResponseHeaderFilterHolder$StandardResponseHeaderFilter.doFilter(StandardResponseHeaderFilterHolder.java:164)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
	at org.apache.druid.server.security.AllowHttpMethodsResourceFilter.doFilter(AllowHttpMethodsResourceFilter.java:78)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
	at org.apache.druid.server.security.AllowOptionsResourceFilter.doFilter(AllowOptionsResourceFilter.java:74)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
	at org.apache.druid.server.security.AllowAllAuthenticator$1.doFilter(AllowAllAuthenticator.java:84)
	at org.apache.druid.server.security.AuthenticationWrappingFilter.doFilter(AuthenticationWrappingFilter.java:59)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
	at org.apache.druid.server.security.SecuritySanityCheckFilter.doFilter(SecuritySanityCheckFilter.java:77)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:772)
	at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.eclipse.jetty.server.Server.handle(Server.java:516)
	at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
	at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
	at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.druid.java.util.common.RE: Failed  to get current leader for [druid-it-leaderelection-coordinator]
	at org.apache.druid.k8s.discovery.DefaultK8sLeaderElectorFactory$1.getCurrentLeader(DefaultK8sLeaderElectorFactory.java:70)
	at org.apache.druid.k8s.discovery.LeaderElectorAsyncWrapper.getCurrentLeader(LeaderElectorAsyncWrapper.java:117)
	at org.apache.druid.k8s.discovery.K8sDruidLeaderSelector.getCurrentLeader(K8sDruidLeaderSelector.java:109)
	... 53 more
Caused by: io.kubernetes.client.openapi.ApiException: Not Found
	at io.kubernetes.client.openapi.ApiClient.handleResponse(ApiClient.java:993)
	at io.kubernetes.client.openapi.ApiClient.execute(ApiClient.java:905)
	at io.kubernetes.client.openapi.apis.CoreV1Api.readNamespacedConfigMapWithHttpInfo(CoreV1Api.java:45887)
	at io.kubernetes.client.openapi.apis.CoreV1Api.readNamespacedConfigMap(CoreV1Api.java:45857)
	at io.kubernetes.client.extended.leaderelection.resourcelock.ConfigMapLock.get(ConfigMapLock.java:61)
	at org.apache.druid.k8s.discovery.DefaultK8sLeaderElectorFactory$1.getCurrentLeader(DefaultK8sLeaderElectorFactory.java:67)
	... 55 more
</pre>

</body>
</html>

suneet-s · 2023-06-25T22:09:11Z

@YongGang If you run Druid with the bin/druid script, you will see this exception in starting up


1) A binding to com.google.common.base.Supplier<java.util.Map<java.lang.String, java.lang.Object>> annotated with @com.google.inject.name.Named(value="heartbeat") was already configured at org.apache.druid.cli.CliCoordinator$1.getHeartbeatSupplier() (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> org.apache.druid.cli.CliCoordinator$1).
  at org.apache.druid.cli.CliOverlord$1$2.getHeartbeatSupplier(CliOverlord.java:367) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> org.apache.druid.cli.CliOverlord$1 -> org.apache.druid.cli.CliOverlord$1$2)

1 error
        at org.apache.druid.cli.GuiceRunnable.makeInjector(GuiceRunnable.java:88)
        at org.apache.druid.cli.ServerRunnable.run(ServerRunnable.java:62)
        at org.apache.druid.cli.Main.main(Main.java:112)
Caused by: com.google.inject.CreationException: Unable to create injector, see the following errors:

1) A binding to com.google.common.base.Supplier<java.util.Map<java.lang.String, java.lang.Object>> annotated with @com.google.inject.name.Named(value="heartbeat") was already configured at org.apache.druid.cli.CliCoordinator$1.getHeartbeatSupplier() (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> org.apache.druid.cli.CliCoordinator$1).
  at org.apache.druid.cli.CliOverlord$1$2.getHeartbeatSupplier(CliOverlord.java:367) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> org.apache.druid.cli.CliOverlord$1 -> org.apache.druid.cli.CliOverlord$1$2)

...

YongGang · 2023-06-26T04:41:04Z

Thanks @suneet-s , now changed to bind HeartbeatSupplier conditionally in Coordinator

YongGang · 2023-06-26T16:07:43Z

Build succeeded! But I don't have write access, please help merge the PR.

* Add OverlordStatusMonitor and CoordinatorStatusMonitor to monitor service leader status * make the monitor more general * resolve conflict * use Supplier pattern to provide metrics * reformat code and doc * move service specific tag to dimension * minor refine * update doc * reformat code * address comments * remove declared exception * bind HeartbeatSupplier conditionally in Coordinator

github-actions bot added the Area - Documentation label Jun 16, 2023

suneet-s added the Area - Metrics/Event Emitting label Jun 16, 2023

suneet-s reviewed Jun 16, 2023

View reviewed changes

YongGang force-pushed the add-leadership-metrics branch from a4f4574 to 6345857 Compare June 20, 2023 05:25

YongGang changed the title ~~Add OverlordStatusMonitor and CoordinatorStatusMonitor to monitor service leader status~~ Add ServiceStatusMonitor to monitor service health Jun 20, 2023

kfaraz requested changes Jun 21, 2023

View reviewed changes

suneet-s reviewed Jun 21, 2023

View reviewed changes

suneet-s reviewed Jun 22, 2023

View reviewed changes

kfaraz reviewed Jun 22, 2023

View reviewed changes

suneet-s approved these changes Jun 22, 2023

View reviewed changes

kfaraz approved these changes Jun 23, 2023

View reviewed changes

YongGang force-pushed the add-leadership-metrics branch from d8b277f to 99c1b3b Compare June 23, 2023 17:50

YongGang added 11 commits June 25, 2023 12:39

Add OverlordStatusMonitor and CoordinatorStatusMonitor to monitor ser…

04a7c90

…vice leader status

make the monitor more general

cf43ef9

resolve conflict

45a6b8d

use Supplier pattern to provide metrics

4312ce8

reformat code and doc

840181c

move service specific tag to dimension

4a0c442

minor refine

024685a

update doc

23db0aa

reformat code

94e315a

address comments

b1f03a0

remove declared exception

3b939c9

YongGang force-pushed the add-leadership-metrics branch from 99c1b3b to 3b939c9 Compare June 25, 2023 19:39

bind HeartbeatSupplier conditionally in Coordinator

95e19e8

suneet-s merged commit b7434be into apache:master Jun 26, 2023

YongGang deleted the add-leadership-metrics branch July 10, 2023 21:54

YongGang mentioned this pull request Jul 10, 2023

Add service/heartbeat metric into statsd-reporter #14564

Merged

10 tasks

abhishekagarwal87 added this to the 27.0 milestone Jul 19, 2023

This was referenced Aug 3, 2023

Additional dimensions for service/heartbeat #14743

Merged

Enable ServiceStatusMonitor in the examples #14744

Merged

AmatyaAvadhanula mentioned this pull request Aug 6, 2023

[DRAFT] 27.0.0 release notes #14761

Closed

		@@ -326,6 +326,12 @@ If `emitBalancingStats` is set to `true` in the Coordinator [dynamic configurati

		## General Health

		### Overlord/Coordinator

	\|`druid/heartbeat`\| Report service health. For Overlord/Coordinator, the dimension is leader count. `ServiceStatusMonitor` must be enabled. \|`heartbeatType`\|1\|
	\|`druid/heartbeat`\| Metric indicating the service is up. `ServiceStatusMonitor` must be enabled. \|`leader` on the Overlord and Coordinator.\|1\|

	\| `org.apache.druid.server.metrics.ServiceStatusMonitor`\|Reports service heartbeat. For overlord/coordinator, the number is leader count. Only supported by overlord/coordinator node types.\|
	\| `org.apache.druid.server.metrics.ServiceStatusMonitor`\|Reports a heartbeat for the service. \|

Add ServiceStatusMonitor to monitor service health #14443

Add ServiceStatusMonitor to monitor service health #14443

Conversation

YongGang commented Jun 16, 2023 • edited Loading

Description

Release note

Key changed/added classes in this PR

suneet-s commented Jun 16, 2023

suneet-s Jun 16, 2023 • edited Loading

Choose a reason for hiding this comment

YongGang commented Jun 20, 2023

suneet-s commented Jun 20, 2023

YongGang commented Jun 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suneet-s left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YongGang commented Jun 21, 2023

suneet-s left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfaraz Jun 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suneet-s left a comment

Choose a reason for hiding this comment

suneet-s commented Jun 22, 2023

YongGang commented Jun 22, 2023

kfaraz left a comment

Choose a reason for hiding this comment

YongGang commented Jun 25, 2023

suneet-s commented Jun 25, 2023

YongGang commented Jun 26, 2023

YongGang commented Jun 26, 2023

YongGang commented Jun 16, 2023 •

edited

Loading

suneet-s Jun 16, 2023 •

edited

Loading

kfaraz Jun 22, 2023 •

edited

Loading