Add segment handoff time metric #13238

AmatyaAvadhanula · 2022-10-18T12:45:43Z

Fixes Issue with metrics emission where the last round is not emitted.

Adds a metric segment/handoff/time to capture the total time taken for handoff for a given set of published segments.

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

kfaraz

Thanks for the changes, @AmatyaAvadhanula .
I have left some comments.

core/src/main/java/org/apache/druid/java/util/metrics/Monitor.java

...ice/src/test/java/org/apache/druid/indexing/common/stats/TaskRealtimeMetricsMonitorTest.java

server/src/test/java/org/apache/druid/segment/realtime/RealtimeMetricsMonitorTest.java

kfaraz · 2022-10-19T02:01:40Z

core/src/main/java/org/apache/druid/java/util/metrics/ParametrizedUriEmitterMonitor.java

+  @Override
+  public void stopAfterLastRoundOfMetricsEmission(ServiceEmitter emitter)
+  {
+    monitors.values().forEach(monitor -> monitor.stopAfterLastRoundOfMetricsEmission(emitter));


Should this be preceded by a call to updateMonitors?
Or is it okay as we are going down anyway?

Thanks for pointing this out. I think it may be needed

updateMonitors() seems to be called in doMonitor which is called by monitor. monitorAndStop() may not have to call this method explicitly

kfaraz · 2022-10-19T02:13:58Z

...service/src/main/java/org/apache/druid/indexing/common/stats/TaskRealtimeMetricsMonitor.java

-  }
-
-  @Override
-  public boolean monitor(ServiceEmitter emitter)


For this to continue having the same behaviour, every place that was calling stop on this object must call the new method.

The last round of metrics emission seems to be relevant only in the cases of RealtimeMetricsMonitor and TaskRealtimeMetricsMonitor, where the new method is being called

server/src/main/java/org/apache/druid/segment/realtime/FireDepartmentMetrics.java

…ndoffTimeMetric

abhishekagarwal87 · 2022-10-26T14:29:04Z

...r/src/main/java/org/apache/druid/segment/realtime/appenderator/StreamAppenderatorDriver.java

@@ -359,6 +360,7 @@ public void onSuccess(Object result)
                      if (numRemainingHandoffSegments.decrementAndGet() == 0) {
                        List<DataSegment> segments = segmentsAndCommitMetadata.getSegments();
                        log.debug("Successfully handed off [%d] segments.", segments.size());
+                        metrics.reportMaxSegmentHandoffTime(System.currentTimeMillis() - handoffStartTime);


can you also add a log line here when the handoff time crosses a certain threshold? The log entry should have the segment id too. Let's say that a cluster admin gets an alert for high segment handoff time. For debugging the alert, one might look at the task logs and spot the culprit segment. what do you think?

That does make sense, it would be nice to have the logs to clearly point to the problematic segments.

But I think the only option is logging this only when the numRemainingHandoffSegments is 0. We would also need to call out in the log that possibly only "some" of the segment ids (out of the ones in the commit metadata) are problematic ones. This should be okay as the commit metadata should not be likely to have too many segments in one batch.

Tapping into the Object result passed to the onSuccess to decide exactly which segment comes after the threshold doesn't seem viable either as the Futures seem to return weird objects (I didn't dig very deep though).

Another concern is the exact value of the threshold itself, which would have to be hard-coded.

@abhishekagarwal87, as @kfaraz has pointed out, it seems difficult to get a good threshold since long coodinator times may affect the handoff period.
Does 15 or maybe 30 mins seem like a good threshold to avoid too many alerts on large clusters?

@AmatyaAvadhanula , we just need to log here for someone debugging an issue to find. We need not raise an alert here. That said, a 10 or 15 min threshold seems fine.

Log has been added

kfaraz

Minor comments, otherwise looks good.
+1 after adding the required logs and build passing.

kfaraz · 2022-10-28T07:21:17Z

core/src/main/java/org/apache/druid/java/util/metrics/Monitor.java

@@ -30,6 +30,14 @@

  void stop();

+
+  /**
+   * Useful to push a last round of metrics before stopping the monitor


Nit:

Suggested change

* Useful to push a last round of metrics before stopping the monitor

* Emit a last round of metrics using the given emitter and then stop the monitor.

kfaraz · 2022-10-28T07:43:08Z

...ice/src/test/java/org/apache/druid/indexing/common/stats/TaskRealtimeMetricsMonitorTest.java

-    Assert.assertFalse(monitor.isStarted());
-    boolean secondRound = monitor.monitor(emitter);
-    boolean thirdRound = monitor.monitor(emitter);
+    Assert.assertTrue(monitor.monitor(emitter));


Much easier to read now!

kfaraz · 2022-10-28T07:49:38Z

server/src/main/java/org/apache/druid/segment/realtime/RealtimeMetricsMonitor.java

+
+      long maxSegmentHandoffTime = metrics.maxSegmentHandoffTime();
+      if (maxSegmentHandoffTime >= 0) {
+        emitter.emit(builder.build("ingest/handoff/time", maxSegmentHandoffTime));


I wonder if we should just keep the default as 0 rather than -1 and always emit the maxSegmentHandoffTime, even if it is 0. This would also match the behaviour of the other metrics being emitted here, especially handoff/count.

I think this metric isn't of much value when segments aren't being handed off and emitting 0 wouldn't be very helpful.

kfaraz · 2022-10-28T07:50:03Z

...r/src/main/java/org/apache/druid/segment/realtime/appenderator/StreamAppenderatorDriver.java

@@ -332,6 +332,7 @@ public ListenableFuture<SegmentsAndCommitMetadata> registerHandoff(SegmentsAndCo
      }

      log.debug("Register handoff of segments: [%s]", waitingSegmentIdList);
+      long handoffStartTime = System.currentTimeMillis();


Better to clarify the intent of a final variable rather than relying on it being "effectively final"

Suggested change

long handoffStartTime = System.currentTimeMillis();

final long handoffStartTime = System.currentTimeMillis();

…ndoffTimeMetric

kfaraz · 2022-10-29T06:14:49Z

Looks much simpler now, @AmatyaAvadhanula .

…ndoffTimeMetric

AmatyaAvadhanula · 2022-11-07T12:20:01Z

Merged since build failure was due to coverage for trivial logging code

Add segment handoff time metric

6b3b7ef

kfaraz reviewed Oct 19, 2022

View reviewed changes

kfaraz added the Area - Metrics/Event Emitting label Oct 19, 2022

AmatyaAvadhanula added 5 commits October 26, 2022 10:24

Merge remote-tracking branch 'upstream/master' into feature-segmentHa…

d942457

…ndoffTimeMetric

Refactor code and modify tests

8b436c0

Fix tests

db481b1

fix test

88e3c47

fix test

70fafd4

kfaraz closed this Oct 26, 2022

kfaraz reopened this Oct 26, 2022

abhishekagarwal87 reviewed Oct 26, 2022

View reviewed changes

kfaraz approved these changes Oct 28, 2022

View reviewed changes

AmatyaAvadhanula added 2 commits October 28, 2022 17:18

Merge remote-tracking branch 'upstream/master' into feature-segmentHa…

3056d13

…ndoffTimeMetric

Cleaner approach + test

0ea9433

AmatyaAvadhanula added 6 commits October 30, 2022 18:36

Merge remote-tracking branch 'upstream/master' into feature-segmentHa…

99fa768

…ndoffTimeMetric

Merge remote-tracking branch 'upstream/master' into feature-segmentHa…

5f13459

…ndoffTimeMetric

Remove monitors on scheduler stop

d00e561

Add warning log for slow handoff

1cb7a1d

Merge remote-tracking branch 'upstream/master' into feature-segmentHa…

8353aea

…ndoffTimeMetric

Remove monitor when scheduler stops

9f196d0

AmatyaAvadhanula merged commit 650840d into apache:master Nov 7, 2022

kfaraz added this to the 25.0 milestone Nov 22, 2022

This was referenced Dec 18, 2022

[Draft] 25.0.0 Release Notes #13592

Closed

Add SegmentAllocationQueue to batch allocation actions #13369

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add segment handoff time metric #13238

Add segment handoff time metric #13238

AmatyaAvadhanula commented Oct 18, 2022

kfaraz left a comment

kfaraz Oct 19, 2022

AmatyaAvadhanula Oct 26, 2022

AmatyaAvadhanula Oct 26, 2022

kfaraz Oct 19, 2022

AmatyaAvadhanula Oct 26, 2022

abhishekagarwal87 Oct 26, 2022

kfaraz Oct 26, 2022

AmatyaAvadhanula Oct 27, 2022

kfaraz Oct 28, 2022

AmatyaAvadhanula Nov 4, 2022

kfaraz left a comment

kfaraz Oct 28, 2022

kfaraz Oct 28, 2022

kfaraz Oct 28, 2022

AmatyaAvadhanula Nov 4, 2022

kfaraz Oct 28, 2022

AmatyaAvadhanula Nov 4, 2022

kfaraz commented Oct 29, 2022

AmatyaAvadhanula commented Nov 7, 2022

	* Useful to push a last round of metrics before stopping the monitor
	* Emit a last round of metrics using the given emitter and then stop the monitor.

	long handoffStartTime = System.currentTimeMillis();
	final long handoffStartTime = System.currentTimeMillis();

Add segment handoff time metric #13238

Add segment handoff time metric #13238

Conversation

AmatyaAvadhanula commented Oct 18, 2022

Fixes Issue with metrics emission where the last round is not emitted.

Adds a metric segment/handoff/time to capture the total time taken for handoff for a given set of published segments.

kfaraz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfaraz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfaraz commented Oct 29, 2022

AmatyaAvadhanula commented Nov 7, 2022