Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ServiceStatusMonitor to monitor service health #14443

Merged
merged 12 commits into from
Jun 26, 2023

Conversation

YongGang
Copy link
Contributor

@YongGang YongGang commented Jun 16, 2023

Description

There are cases that no leader for Overlord/Coordinator and cases that multiple leaders selected, this could happen when service is over loaded or network partition happened.

This PR adds a general heartbeat metric to indicate service health. For Overlord/Coordinator, the sum of druid/heartbeat metric with leader=1 dimension should always be one in a heathy Druid cluster.
The new metric example for Coordinator:
{"feed":"metrics","leader":1,"metric":"druid/heartbeat","service":"coordinator","host":"localhost:8081","version":"","value":1,"timestamp":"2023-06-21T21:55:53.216Z"}
For Overlord:
{"feed":"metrics","leader":1,"metric":"druid/heartbeat","service":"overlord","host":"localhost:8090","version":"","value":1,"timestamp":"2023-06-21T21:34:01.764Z"}

Other service need to provide following supplier to report druid/heartbeat metric:

@Named("heartbeat")
Supplier<Map<String, Object>> heartbeatTagsSupplier

Release note

Add a new Monitor to monitor the health of Overlord and Coordinator service.


Key changed/added classes in this PR
  • add ServiceStatusMonitor

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@suneet-s
Copy link
Contributor

Thanks for your first contribution to Druid @YongGang !

Instead of introducing one monitor per service, I'd recommend creating a single monitor that can be installed on every service. This will make it easier to use as an operator can set this monitor in the common runtime properties instead of having to configure it per service.

Instead of the leader/count metric, what do you think of introducing a druid/heartbeat metric that can be annotated with dimensions like leader for the coordinator / overlord, or task_id for the peons, or disabled for middle managers?

We do not need to implement all these ideas in this PR, but I think a heartbeat metric will be more flexible than a metric that is scoped to the leader.

public boolean doMonitor(ServiceEmitter emitter) {
final ServiceMetricEvent.Builder builder = new ServiceMetricEvent.Builder();

builder.setDimension("serviceType", "overlord");
Copy link
Contributor

@suneet-s suneet-s Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the monitor is using the ServiceEmitter, the service and host will be added auto-magically :) So this is not needed.

@YongGang
Copy link
Contributor Author

Thanks for the review @suneet-s .
This is a good idea, I updated the PR to make the metric more general.
For Peon and MM, I haven't found a good way to report the health status align with what we do here though.

@YongGang YongGang changed the title Add OverlordStatusMonitor and CoordinatorStatusMonitor to monitor service leader status Add ServiceStatusMonitor to monitor service health Jun 20, 2023
@suneet-s
Copy link
Contributor

For Peon and MM, I haven't found a good way to report the health status align with what we do here though.

My recommendation would be to remove the ServiceStatusProvider that is currently implemented and replace it with an named injected @Named("heartbeat") Supplier<Map<String, Object> heartbeatTags. Then the CliCoordinator and other services can inject the Supplier based on what each service deems is a useful annotation.

@YongGang
Copy link
Contributor Author

Updated to use Supplier pattern, local tested to see correct output.

|`org.apache.druid.server.metrics.TaskCountStatsMonitor`|Reports how many ingestion tasks are currently running/pending/waiting and also the number of successful/failed tasks per emission period.|
|`org.apache.druid.server.metrics.TaskSlotCountStatsMonitor`|Reports metrics about task slot usage per emission period.|
|`org.apache.druid.server.metrics.WorkerTaskCountStatsMonitor`|Reports how many ingestion tasks are currently running/pending/waiting, the number of successful/failed tasks, and metrics about task slot usage for the reporting worker, per emission period. Only supported by middleManager node types.|
| Name | Description |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original formatting aligns better with the rest of the Druid docs.

@@ -326,6 +326,12 @@ If `emitBalancingStats` is set to `true` in the Coordinator [dynamic configurati

## General Health

### Overlord/Coordinator

| Metric | Description | Dimensions | Normal Value |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the formatting style used in the rest of the Druid docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Intellij keep reformatting the doc, interesting.

return true;
}

heartbeatTagsSupplier.get().forEach((k, v) -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not mistaken, the tags should each be a separate dimension and not be emitted as metric values.
The metric value will always be 1, as it is a simple count.

In the current code, you would be emitting the druid/heartbeat metric multiple times in every invocation of doMonitor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand.
For Overlord/Coordinator, this druid/heartbeat metric will be 1 with heartbeatType dimension set to leader. And since this map Map<String, Number> only have one entry, so the metric only reported once per doMonitor call.
For other potential service (or component within the service) to use this monitor, druid/heartbeat metric doesn't have to be 1, that's why heartbeatType dimension is introduced as the metric can have different meaning for different heartbeatType.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what Kashif is mentioning here is that heartbeatTagsSupplier should be a map of dimension keys to values and the metric that is reported for the heartbeat is always a constant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated the code with dimensions configured.

Assert.assertEquals("druid/heartbeat", emitter.getEvents().get(0).toMap().get("metric"));
Assert.assertEquals(1, emitter.getEvents().get(0).toMap().get("value"));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: newline at end of file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

return true;
}

heartbeatTagsSupplier.get().forEach((k, v) -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a null check on the supplier, just in case someone wants to use this monitor for other services, where the tags supplier is not being injected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the heartbeat work on other services, I think the pattern in this monitor should be

ServiceMetricEventBuilder builder  = ...;
if (heartbeatDimensions is not empty) {
  heartbeatDimensions.forEach(builder.setDimension(k, v);
}
emitter.emit(builder.build("druid/heartbeat", 1);

Supplier<Map<String, Number>> heartbeatTagsSupplier = null;

@Inject
public ServiceStatusMonitor() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the empty constructor needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@@ -326,6 +326,12 @@ If `emitBalancingStats` is set to `true` in the Coordinator [dynamic configurati

## General Health

### Overlord/Coordinator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be used for all services, even though it might not be very useful right now except for coordinator and overlord.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're closer now.

Can you please update the description of the PR to reflect the new design and include examples of what the metrics look like once the feedback is incorporated.

return true;
}

heartbeatTagsSupplier.get().forEach((k, v) -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what Kashif is mentioning here is that heartbeatTagsSupplier should be a map of dimension keys to values and the metric that is reported for the heartbeat is always a constant.

return true;
}

heartbeatTagsSupplier.get().forEach((k, v) -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make the heartbeat work on other services, I think the pattern in this monitor should be

ServiceMetricEventBuilder builder  = ...;
if (heartbeatDimensions is not empty) {
  heartbeatDimensions.forEach(builder.setDimension(k, v);
}
emitter.emit(builder.build("druid/heartbeat", 1);

@YongGang
Copy link
Contributor Author

Updated the PR description to reflect the new design.

Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good once the docs + tests are updated!


|Metric|Description|Dimensions|Normal Value|
|------|-----------|----------|------------|
|`druid/heartbeat`| Report service health. For Overlord/Coordinator, the dimension is leader count. `ServiceStatusMonitor` must be enabled. |`heartbeatType`|1|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale doc.

Suggested change
|`druid/heartbeat`| Report service health. For Overlord/Coordinator, the dimension is leader count. `ServiceStatusMonitor` must be enabled. |`heartbeatType`|1|
|`druid/heartbeat`| Metric indicating the service is up. `ServiceStatusMonitor` must be enabled. |`leader` on the Overlord and Coordinator.|1|

@@ -399,6 +399,7 @@ Metric monitoring is an essential part of Druid operations. The following monit
|`org.apache.druid.server.metrics.TaskCountStatsMonitor`|Reports how many ingestion tasks are currently running/pending/waiting and also the number of successful/failed tasks per emission period.|
|`org.apache.druid.server.metrics.TaskSlotCountStatsMonitor`|Reports metrics about task slot usage per emission period.|
|`org.apache.druid.server.metrics.WorkerTaskCountStatsMonitor`|Reports how many ingestion tasks are currently running/pending/waiting, the number of successful/failed tasks, and metrics about task slot usage for the reporting worker, per emission period. Only supported by middleManager node types.|
| `org.apache.druid.server.metrics.ServiceStatusMonitor`|Reports service heartbeat. For overlord/coordinator, the number is leader count. Only supported by overlord/coordinator node types.|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale doc

Suggested change
| `org.apache.druid.server.metrics.ServiceStatusMonitor`|Reports service heartbeat. For overlord/coordinator, the number is leader count. Only supported by overlord/coordinator node types.|
| `org.apache.druid.server.metrics.ServiceStatusMonitor`|Reports a heartbeat for the service. |

Comment on lines 57 to 80
heartbeatTags.put("leader", 1);
emitter.flush();
monitor.doMonitor(emitter);
Assert.assertEquals(1, emitter.getEvents().size());
Assert.assertEquals(1, emitter.getEvents().get(0).toMap().get("leader"));
Assert.assertEquals("druid/heartbeat", emitter.getEvents().get(0).toMap().get("metric"));
Assert.assertEquals(1, emitter.getEvents().get(0).toMap().get("value"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this part of the test is needed. Instead, can you please add a test for adding more than 1 dimension to the metric.

Suggested change
heartbeatTags.put("leader", 1);
emitter.flush();
monitor.doMonitor(emitter);
Assert.assertEquals(1, emitter.getEvents().size());
Assert.assertEquals(1, emitter.getEvents().get(0).toMap().get("leader"));
Assert.assertEquals("druid/heartbeat", emitter.getEvents().get(0).toMap().get("metric"));
Assert.assertEquals(1, emitter.getEvents().get(0).toMap().get("value"));

And another test for no dimensions in the heartbeatTagsSupplier


|Metric|Description|Dimensions|Normal Value|
|------|-----------|----------|------------|
|`druid/heartbeat`| Report service health. For Overlord/Coordinator, the dimension is leader count. `ServiceStatusMonitor` must be enabled. |`heartbeatType`|1|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think we should prefix this metric with either cluster/ or server/ so that this metric becomes cluster/heartbeat or server/heartbeat. We can add other relevant metrics later which have the same prefix.

The prefix druid/ doesn't seem to give any info about the metric.
@suneet-s , @YongGang , what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference. We could also just call the metric heartbeat instead of druid/hearbeat

The Druid process might be running in a container, in which case server can be misleading. Do you have some examples in mind of other metrics that would live under cluster/ that wouldn't make sense under druid/?

Copy link
Contributor

@kfaraz kfaraz Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't have a concrete example in mind either. I was thinking mostly of any cluster-level information, inter-service communication, etc.

I agree with you that server/ can be misleading when running on containers. I avoided using it in a PR due to similar reasons. How about service/ as an alternative? :)

I am working on a PR where I think I am going to use cluster/ for server view syncs. e.g. cluster/serverview/synced which denotes the sync status between coordinator/broker inventory and different historical/peon processes.

druid/ is pretty much a catch-all and any metric that goes under cluster/ could potentially go under druid/ as well. But I generally try to use prefixes that make the metrics a little more user-friendly and somewhat self-explanatory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

service/ prefix seems fine to me as the heartbeat doesn't have to be on cluster level.

Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks @YongGang !

@suneet-s
Copy link
Contributor

The intelliJ inspections failure looks legitimate

Error:  server/src/test/java/org/apache/druid/server/metrics/ServiceStatusMonitorTest.java:40 -- The declared exception <code>Exception</code> is never thrown
Error: Process completed with exit code 1.
0s

I have re-triggered the other failing jobs 🤞

@YongGang
Copy link
Contributor Author

The intelliJ inspections failure looks legitimate

Error:  server/src/test/java/org/apache/druid/server/metrics/ServiceStatusMonitorTest.java:40 -- The declared exception <code>Exception</code> is never thrown
Error: Process completed with exit code 1.
0s

I have re-triggered the other failing jobs 🤞

I removed the Exception declaration (was added by IntelliJ auto gen code).

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your first PR, @YongGang !

@YongGang
Copy link
Contributor Author

I rebased from master again hope to fix the IT (ITNestedQueryPushDownTest) failure.
Although the error said failed to get leader, I couldn't think of a way this PR caused this as the ServiceStatusMonitor has been registered to run.

2023-06-24T04:25:40,322 INFO [main] org.apache.druid.testing.utils.ITRetryUtil - Trying attempt[1/240]...
2023-06-24T04:25:40,342 INFO [main] org.apache.druid.testing.utils.DruidClusterAdminClient - 500 Server Error <html>
<head>
<meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
<title>Error 500 java.lang.RuntimeException: org.apache.druid.java.util.common.RE: Failed  to get current leader for [druid-it-leaderelection-coordinator]</title>
</head>
<body><h2>HTTP ERROR 500 java.lang.RuntimeException: org.apache.druid.java.util.common.RE: Failed  to get current leader for [druid-it-leaderelection-coordinator]</h2>
<table>
<tr><th>URI:</th><td>/druid/coordinator/v1/config</td></tr>
<tr><th>STATUS:</th><td>500</td></tr>
<tr><th>MESSAGE:</th><td>java.lang.RuntimeException: org.apache.druid.java.util.common.RE: Failed  to get current leader for [druid-it-leaderelection-coordinator]</td></tr>
<tr><th>SERVLET:</th><td>org.apache.druid.server.AsyncManagementForwardingServlet-36cdcae0</td></tr>
<tr><th>CAUSED BY:</th><td>java.lang.RuntimeException: org.apache.druid.java.util.common.RE: Failed  to get current leader for [druid-it-leaderelection-coordinator]</td></tr>
<tr><th>CAUSED BY:</th><td>org.apache.druid.java.util.common.RE: Failed  to get current leader for [druid-it-leaderelection-coordinator]</td></tr>
<tr><th>CAUSED BY:</th><td>io.kubernetes.client.openapi.ApiException: Not Found</td></tr>
</table>
<h3>Caused by:</h3><pre>java.lang.RuntimeException: org.apache.druid.java.util.common.RE: Failed  to get current leader for [druid-it-leaderelection-coordinator]
	at org.apache.druid.k8s.discovery.K8sDruidLeaderSelector.getCurrentLeader(K8sDruidLeaderSelector.java:112)
	at org.apache.druid.server.AsyncManagementForwardingServlet.service(AsyncManagementForwardingServlet.java:94)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
	at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1656)
	at org.apache.druid.server.security.PreResponseAuthorizationCheckFilter.doFilter(PreResponseAuthorizationCheckFilter.java:84)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
	at org.apache.druid.server.initialization.jetty.StandardResponseHeaderFilterHolder$StandardResponseHeaderFilter.doFilter(StandardResponseHeaderFilterHolder.java:164)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
	at org.apache.druid.server.security.AllowHttpMethodsResourceFilter.doFilter(AllowHttpMethodsResourceFilter.java:78)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
	at org.apache.druid.server.security.AllowOptionsResourceFilter.doFilter(AllowOptionsResourceFilter.java:74)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
	at org.apache.druid.server.security.AllowAllAuthenticator$1.doFilter(AllowAllAuthenticator.java:84)
	at org.apache.druid.server.security.AuthenticationWrappingFilter.doFilter(AuthenticationWrappingFilter.java:59)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
	at org.apache.druid.server.security.SecuritySanityCheckFilter.doFilter(SecuritySanityCheckFilter.java:77)
	at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)
	at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1626)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:552)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:505)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:772)
	at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:59)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
	at org.eclipse.jetty.server.Server.handle(Server.java:516)
	at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)
	at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
	at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.druid.java.util.common.RE: Failed  to get current leader for [druid-it-leaderelection-coordinator]
	at org.apache.druid.k8s.discovery.DefaultK8sLeaderElectorFactory$1.getCurrentLeader(DefaultK8sLeaderElectorFactory.java:70)
	at org.apache.druid.k8s.discovery.LeaderElectorAsyncWrapper.getCurrentLeader(LeaderElectorAsyncWrapper.java:117)
	at org.apache.druid.k8s.discovery.K8sDruidLeaderSelector.getCurrentLeader(K8sDruidLeaderSelector.java:109)
	... 53 more
Caused by: io.kubernetes.client.openapi.ApiException: Not Found
	at io.kubernetes.client.openapi.ApiClient.handleResponse(ApiClient.java:993)
	at io.kubernetes.client.openapi.ApiClient.execute(ApiClient.java:905)
	at io.kubernetes.client.openapi.apis.CoreV1Api.readNamespacedConfigMapWithHttpInfo(CoreV1Api.java:45887)
	at io.kubernetes.client.openapi.apis.CoreV1Api.readNamespacedConfigMap(CoreV1Api.java:45857)
	at io.kubernetes.client.extended.leaderelection.resourcelock.ConfigMapLock.get(ConfigMapLock.java:61)
	at org.apache.druid.k8s.discovery.DefaultK8sLeaderElectorFactory$1.getCurrentLeader(DefaultK8sLeaderElectorFactory.java:67)
	... 55 more
</pre>

</body>
</html>

@suneet-s
Copy link
Contributor

@YongGang If you run Druid with the bin/druid script, you will see this exception in starting up


1) A binding to com.google.common.base.Supplier<java.util.Map<java.lang.String, java.lang.Object>> annotated with @com.google.inject.name.Named(value="heartbeat") was already configured at org.apache.druid.cli.CliCoordinator$1.getHeartbeatSupplier() (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> org.apache.druid.cli.CliCoordinator$1).
  at org.apache.druid.cli.CliOverlord$1$2.getHeartbeatSupplier(CliOverlord.java:367) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> org.apache.druid.cli.CliOverlord$1 -> org.apache.druid.cli.CliOverlord$1$2)

1 error
        at org.apache.druid.cli.GuiceRunnable.makeInjector(GuiceRunnable.java:88)
        at org.apache.druid.cli.ServerRunnable.run(ServerRunnable.java:62)
        at org.apache.druid.cli.Main.main(Main.java:112)
Caused by: com.google.inject.CreationException: Unable to create injector, see the following errors:

1) A binding to com.google.common.base.Supplier<java.util.Map<java.lang.String, java.lang.Object>> annotated with @com.google.inject.name.Named(value="heartbeat") was already configured at org.apache.druid.cli.CliCoordinator$1.getHeartbeatSupplier() (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> org.apache.druid.cli.CliCoordinator$1).
  at org.apache.druid.cli.CliOverlord$1$2.getHeartbeatSupplier(CliOverlord.java:367) (via modules: com.google.inject.util.Modules$OverrideModule -> com.google.inject.util.Modules$OverrideModule -> org.apache.druid.cli.CliOverlord$1 -> org.apache.druid.cli.CliOverlord$1$2)

...

@YongGang
Copy link
Contributor Author

Thanks @suneet-s , now changed to bind HeartbeatSupplier conditionally in Coordinator

@YongGang
Copy link
Contributor Author

Build succeeded! But I don't have write access, please help merge the PR.

@suneet-s suneet-s merged commit b7434be into apache:master Jun 26, 2023
@YongGang YongGang deleted the add-leadership-metrics branch July 10, 2023 21:54
@abhishekagarwal87 abhishekagarwal87 added this to the 27.0 milestone Jul 19, 2023
sergioferragut pushed a commit to sergioferragut/druid that referenced this pull request Jul 21, 2023
* Add OverlordStatusMonitor and CoordinatorStatusMonitor to monitor service leader status

* make the monitor more general

* resolve conflict

* use Supplier pattern to provide metrics

* reformat code and doc

* move service specific tag to dimension

* minor refine

* update doc

* reformat code

* address comments

* remove declared exception

* bind HeartbeatSupplier conditionally in Coordinator
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants