-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ServiceStatusMonitor to monitor service health #14443
Conversation
Thanks for your first contribution to Druid @YongGang ! Instead of introducing one monitor per service, I'd recommend creating a single monitor that can be installed on every service. This will make it easier to use as an operator can set this monitor in the common runtime properties instead of having to configure it per service. Instead of the We do not need to implement all these ideas in this PR, but I think a heartbeat metric will be more flexible than a metric that is scoped to the leader. |
public boolean doMonitor(ServiceEmitter emitter) { | ||
final ServiceMetricEvent.Builder builder = new ServiceMetricEvent.Builder(); | ||
|
||
builder.setDimension("serviceType", "overlord"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the monitor is using the ServiceEmitter, the service and host will be added auto-magically :) So this is not needed.
Thanks for the review @suneet-s . |
a4f4574
to
6345857
Compare
My recommendation would be to remove the |
Updated to use |
docs/configuration/index.md
Outdated
|`org.apache.druid.server.metrics.TaskCountStatsMonitor`|Reports how many ingestion tasks are currently running/pending/waiting and also the number of successful/failed tasks per emission period.| | ||
|`org.apache.druid.server.metrics.TaskSlotCountStatsMonitor`|Reports metrics about task slot usage per emission period.| | ||
|`org.apache.druid.server.metrics.WorkerTaskCountStatsMonitor`|Reports how many ingestion tasks are currently running/pending/waiting, the number of successful/failed tasks, and metrics about task slot usage for the reporting worker, per emission period. Only supported by middleManager node types.| | ||
| Name | Description | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the original formatting aligns better with the rest of the Druid docs.
docs/operations/metrics.md
Outdated
@@ -326,6 +326,12 @@ If `emitBalancingStats` is set to `true` in the Coordinator [dynamic configurati | |||
|
|||
## General Health | |||
|
|||
### Overlord/Coordinator | |||
|
|||
| Metric | Description | Dimensions | Normal Value | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the formatting style used in the rest of the Druid docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. Intellij keep reformatting the doc, interesting.
return true; | ||
} | ||
|
||
heartbeatTagsSupplier.get().forEach((k, v) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am not mistaken, the tags should each be a separate dimension and not be emitted as metric values.
The metric value will always be 1, as it is a simple count.
In the current code, you would be emitting the druid/heartbeat
metric multiple times in every invocation of doMonitor
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand.
For Overlord/Coordinator, this druid/heartbeat
metric will be 1 with heartbeatType
dimension set to leader
. And since this map Map<String, Number>
only have one entry, so the metric only reported once per doMonitor
call.
For other potential service (or component within the service) to use this monitor, druid/heartbeat
metric doesn't have to be 1, that's why heartbeatType
dimension is introduced as the metric can have different meaning for different heartbeatType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what Kashif is mentioning here is that heartbeatTagsSupplier
should be a map of dimension keys to values and the metric that is reported for the heartbeat is always a constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated the code with dimensions configured.
Assert.assertEquals("druid/heartbeat", emitter.getEvents().get(0).toMap().get("metric")); | ||
Assert.assertEquals(1, emitter.getEvents().get(0).toMap().get("value")); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: newline at end of file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
return true; | ||
} | ||
|
||
heartbeatTagsSupplier.get().forEach((k, v) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a null check on the supplier, just in case someone wants to use this monitor for other services, where the tags supplier is not being injected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the heartbeat work on other services, I think the pattern in this monitor should be
ServiceMetricEventBuilder builder = ...;
if (heartbeatDimensions is not empty) {
heartbeatDimensions.forEach(builder.setDimension(k, v);
}
emitter.emit(builder.build("druid/heartbeat", 1);
Supplier<Map<String, Number>> heartbeatTagsSupplier = null; | ||
|
||
@Inject | ||
public ServiceStatusMonitor() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the empty constructor needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
docs/operations/metrics.md
Outdated
@@ -326,6 +326,12 @@ If `emitBalancingStats` is set to `true` in the Coordinator [dynamic configurati | |||
|
|||
## General Health | |||
|
|||
### Overlord/Coordinator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be used for all services, even though it might not be very useful right now except for coordinator and overlord.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're closer now.
Can you please update the description of the PR to reflect the new design and include examples of what the metrics look like once the feedback is incorporated.
return true; | ||
} | ||
|
||
heartbeatTagsSupplier.get().forEach((k, v) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what Kashif is mentioning here is that heartbeatTagsSupplier
should be a map of dimension keys to values and the metric that is reported for the heartbeat is always a constant.
return true; | ||
} | ||
|
||
heartbeatTagsSupplier.get().forEach((k, v) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the heartbeat work on other services, I think the pattern in this monitor should be
ServiceMetricEventBuilder builder = ...;
if (heartbeatDimensions is not empty) {
heartbeatDimensions.forEach(builder.setDimension(k, v);
}
emitter.emit(builder.build("druid/heartbeat", 1);
Updated the PR description to reflect the new design. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good once the docs + tests are updated!
docs/operations/metrics.md
Outdated
|
||
|Metric|Description|Dimensions|Normal Value| | ||
|------|-----------|----------|------------| | ||
|`druid/heartbeat`| Report service health. For Overlord/Coordinator, the dimension is leader count. `ServiceStatusMonitor` must be enabled. |`heartbeatType`|1| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stale doc.
|`druid/heartbeat`| Report service health. For Overlord/Coordinator, the dimension is leader count. `ServiceStatusMonitor` must be enabled. |`heartbeatType`|1| | |
|`druid/heartbeat`| Metric indicating the service is up. `ServiceStatusMonitor` must be enabled. |`leader` on the Overlord and Coordinator.|1| |
docs/configuration/index.md
Outdated
@@ -399,6 +399,7 @@ Metric monitoring is an essential part of Druid operations. The following monit | |||
|`org.apache.druid.server.metrics.TaskCountStatsMonitor`|Reports how many ingestion tasks are currently running/pending/waiting and also the number of successful/failed tasks per emission period.| | |||
|`org.apache.druid.server.metrics.TaskSlotCountStatsMonitor`|Reports metrics about task slot usage per emission period.| | |||
|`org.apache.druid.server.metrics.WorkerTaskCountStatsMonitor`|Reports how many ingestion tasks are currently running/pending/waiting, the number of successful/failed tasks, and metrics about task slot usage for the reporting worker, per emission period. Only supported by middleManager node types.| | |||
| `org.apache.druid.server.metrics.ServiceStatusMonitor`|Reports service heartbeat. For overlord/coordinator, the number is leader count. Only supported by overlord/coordinator node types.| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stale doc
| `org.apache.druid.server.metrics.ServiceStatusMonitor`|Reports service heartbeat. For overlord/coordinator, the number is leader count. Only supported by overlord/coordinator node types.| | |
| `org.apache.druid.server.metrics.ServiceStatusMonitor`|Reports a heartbeat for the service. | |
server/src/main/java/org/apache/druid/server/metrics/ServiceStatusMonitor.java
Show resolved
Hide resolved
server/src/test/java/org/apache/druid/server/metrics/ServiceStatusMonitorTest.java
Show resolved
Hide resolved
heartbeatTags.put("leader", 1); | ||
emitter.flush(); | ||
monitor.doMonitor(emitter); | ||
Assert.assertEquals(1, emitter.getEvents().size()); | ||
Assert.assertEquals(1, emitter.getEvents().get(0).toMap().get("leader")); | ||
Assert.assertEquals("druid/heartbeat", emitter.getEvents().get(0).toMap().get("metric")); | ||
Assert.assertEquals(1, emitter.getEvents().get(0).toMap().get("value")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this part of the test is needed. Instead, can you please add a test for adding more than 1 dimension to the metric.
heartbeatTags.put("leader", 1); | |
emitter.flush(); | |
monitor.doMonitor(emitter); | |
Assert.assertEquals(1, emitter.getEvents().size()); | |
Assert.assertEquals(1, emitter.getEvents().get(0).toMap().get("leader")); | |
Assert.assertEquals("druid/heartbeat", emitter.getEvents().get(0).toMap().get("metric")); | |
Assert.assertEquals(1, emitter.getEvents().get(0).toMap().get("value")); |
And another test for no dimensions in the heartbeatTagsSupplier
docs/operations/metrics.md
Outdated
|
||
|Metric|Description|Dimensions|Normal Value| | ||
|------|-----------|----------|------------| | ||
|`druid/heartbeat`| Report service health. For Overlord/Coordinator, the dimension is leader count. `ServiceStatusMonitor` must be enabled. |`heartbeatType`|1| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think we should prefix this metric with either cluster/
or server/
so that this metric becomes cluster/heartbeat
or server/heartbeat
. We can add other relevant metrics later which have the same prefix.
The prefix druid/
doesn't seem to give any info about the metric.
@suneet-s , @YongGang , what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong preference. We could also just call the metric heartbeat
instead of druid/hearbeat
The Druid process might be running in a container, in which case server
can be misleading. Do you have some examples in mind of other metrics that would live under cluster/
that wouldn't make sense under druid/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't have a concrete example in mind either. I was thinking mostly of any cluster-level information, inter-service communication, etc.
I agree with you that server/
can be misleading when running on containers. I avoided using it in a PR due to similar reasons. How about service/
as an alternative? :)
I am working on a PR where I think I am going to use cluster/
for server view syncs. e.g. cluster/serverview/synced
which denotes the sync status between coordinator/broker inventory and different historical/peon processes.
druid/
is pretty much a catch-all and any metric that goes under cluster/
could potentially go under druid/
as well. But I generally try to use prefixes that make the metrics a little more user-friendly and somewhat self-explanatory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
service/
prefix seems fine to me as the heartbeat doesn't have to be on cluster
level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks @YongGang !
The intelliJ inspections failure looks legitimate
I have re-triggered the other failing jobs 🤞 |
I removed the Exception declaration (was added by IntelliJ auto gen code). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your first PR, @YongGang !
d8b277f
to
99c1b3b
Compare
…vice leader status
99c1b3b
to
3b939c9
Compare
I rebased from master again hope to fix the IT (ITNestedQueryPushDownTest) failure.
|
@YongGang If you run Druid with the
|
Thanks @suneet-s , now changed to bind HeartbeatSupplier conditionally in Coordinator |
Build succeeded! But I don't have write access, please help merge the PR. |
* Add OverlordStatusMonitor and CoordinatorStatusMonitor to monitor service leader status * make the monitor more general * resolve conflict * use Supplier pattern to provide metrics * reformat code and doc * move service specific tag to dimension * minor refine * update doc * reformat code * address comments * remove declared exception * bind HeartbeatSupplier conditionally in Coordinator
Description
There are cases that no leader for Overlord/Coordinator and cases that multiple leaders selected, this could happen when service is over loaded or network partition happened.
This PR adds a general heartbeat metric to indicate service health. For Overlord/Coordinator, the sum of
druid/heartbeat
metric withleader=1
dimension should always be one in a heathy Druid cluster.The new metric example for Coordinator:
{"feed":"metrics","leader":1,"metric":"druid/heartbeat","service":"coordinator","host":"localhost:8081","version":"","value":1,"timestamp":"2023-06-21T21:55:53.216Z"}
For Overlord:
{"feed":"metrics","leader":1,"metric":"druid/heartbeat","service":"overlord","host":"localhost:8090","version":"","value":1,"timestamp":"2023-06-21T21:34:01.764Z"}
Other service need to provide following supplier to report
druid/heartbeat
metric:Release note
Add a new Monitor to monitor the health of Overlord and Coordinator service.
Key changed/added classes in this PR
ServiceStatusMonitor
This PR has: