-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add segment handoff time metric #13238
Add segment handoff time metric #13238
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, @AmatyaAvadhanula .
I have left some comments.
core/src/main/java/org/apache/druid/java/util/metrics/Monitor.java
Outdated
Show resolved
Hide resolved
...ice/src/test/java/org/apache/druid/indexing/common/stats/TaskRealtimeMetricsMonitorTest.java
Show resolved
Hide resolved
server/src/test/java/org/apache/druid/segment/realtime/RealtimeMetricsMonitorTest.java
Show resolved
Hide resolved
@Override | ||
public void stopAfterLastRoundOfMetricsEmission(ServiceEmitter emitter) | ||
{ | ||
monitors.values().forEach(monitor -> monitor.stopAfterLastRoundOfMetricsEmission(emitter)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be preceded by a call to updateMonitors
?
Or is it okay as we are going down anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. I think it may be needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updateMonitors() seems to be called in doMonitor which is called by monitor. monitorAndStop() may not have to call this method explicitly
} | ||
|
||
@Override | ||
public boolean monitor(ServiceEmitter emitter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this to continue having the same behaviour, every place that was calling stop
on this object must call the new method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last round of metrics emission seems to be relevant only in the cases of RealtimeMetricsMonitor and TaskRealtimeMetricsMonitor, where the new method is being called
server/src/main/java/org/apache/druid/segment/realtime/FireDepartmentMetrics.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/segment/realtime/FireDepartmentMetrics.java
Outdated
Show resolved
Hide resolved
@@ -359,6 +360,7 @@ public void onSuccess(Object result) | |||
if (numRemainingHandoffSegments.decrementAndGet() == 0) { | |||
List<DataSegment> segments = segmentsAndCommitMetadata.getSegments(); | |||
log.debug("Successfully handed off [%d] segments.", segments.size()); | |||
metrics.reportMaxSegmentHandoffTime(System.currentTimeMillis() - handoffStartTime); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also add a log line here when the handoff time crosses a certain threshold? The log entry should have the segment id too. Let's say that a cluster admin gets an alert for high segment handoff time. For debugging the alert, one might look at the task logs and spot the culprit segment. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That does make sense, it would be nice to have the logs to clearly point to the problematic segments.
But I think the only option is logging this only when the numRemainingHandoffSegments
is 0. We would also need to call out in the log that possibly only "some" of the segment ids (out of the ones in the commit metadata) are problematic ones. This should be okay as the commit metadata should not be likely to have too many segments in one batch.
Tapping into the Object result
passed to the onSuccess
to decide exactly which segment comes after the threshold doesn't seem viable either as the Futures seem to return weird objects (I didn't dig very deep though).
Another concern is the exact value of the threshold itself, which would have to be hard-coded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhishekagarwal87, as @kfaraz has pointed out, it seems difficult to get a good threshold since long coodinator times may affect the handoff period.
Does 15 or maybe 30 mins seem like a good threshold to avoid too many alerts on large clusters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AmatyaAvadhanula , we just need to log here for someone debugging an issue to find. We need not raise an alert here. That said, a 10 or 15 min threshold seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log has been added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, otherwise looks good.
+1 after adding the required logs and build passing.
@@ -30,6 +30,14 @@ | |||
|
|||
void stop(); | |||
|
|||
|
|||
/** | |||
* Useful to push a last round of metrics before stopping the monitor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
* Useful to push a last round of metrics before stopping the monitor | |
* Emit a last round of metrics using the given emitter and then stop the monitor. |
Assert.assertFalse(monitor.isStarted()); | ||
boolean secondRound = monitor.monitor(emitter); | ||
boolean thirdRound = monitor.monitor(emitter); | ||
Assert.assertTrue(monitor.monitor(emitter)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much easier to read now!
|
||
long maxSegmentHandoffTime = metrics.maxSegmentHandoffTime(); | ||
if (maxSegmentHandoffTime >= 0) { | ||
emitter.emit(builder.build("ingest/handoff/time", maxSegmentHandoffTime)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should just keep the default as 0 rather than -1 and always emit the maxSegmentHandoffTime
, even if it is 0. This would also match the behaviour of the other metrics being emitted here, especially handoff/count
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this metric isn't of much value when segments aren't being handed off and emitting 0 wouldn't be very helpful.
@@ -332,6 +332,7 @@ public ListenableFuture<SegmentsAndCommitMetadata> registerHandoff(SegmentsAndCo | |||
} | |||
|
|||
log.debug("Register handoff of segments: [%s]", waitingSegmentIdList); | |||
long handoffStartTime = System.currentTimeMillis(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to clarify the intent of a final variable rather than relying on it being "effectively final"
long handoffStartTime = System.currentTimeMillis(); | |
final long handoffStartTime = System.currentTimeMillis(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Looks much simpler now, @AmatyaAvadhanula . |
Merged since build failure was due to coverage for trivial logging code |
Fixes Issue with metrics emission where the last round is not emitted.
Adds a metric segment/handoff/time to capture the total time taken for handoff for a given set of published segments.
This PR has: