Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add segment handoff time metric #13238

Conversation

AmatyaAvadhanula
Copy link
Contributor

Fixes Issue with metrics emission where the last round is not emitted.

Adds a metric segment/handoff/time to capture the total time taken for handoff for a given set of published segments.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, @AmatyaAvadhanula .
I have left some comments.

@Override
public void stopAfterLastRoundOfMetricsEmission(ServiceEmitter emitter)
{
monitors.values().forEach(monitor -> monitor.stopAfterLastRoundOfMetricsEmission(emitter));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be preceded by a call to updateMonitors?
Or is it okay as we are going down anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I think it may be needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateMonitors() seems to be called in doMonitor which is called by monitor. monitorAndStop() may not have to call this method explicitly

}

@Override
public boolean monitor(ServiceEmitter emitter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this to continue having the same behaviour, every place that was calling stop on this object must call the new method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last round of metrics emission seems to be relevant only in the cases of RealtimeMetricsMonitor and TaskRealtimeMetricsMonitor, where the new method is being called

@kfaraz kfaraz closed this Oct 26, 2022
@kfaraz kfaraz reopened this Oct 26, 2022
@@ -359,6 +360,7 @@ public void onSuccess(Object result)
if (numRemainingHandoffSegments.decrementAndGet() == 0) {
List<DataSegment> segments = segmentsAndCommitMetadata.getSegments();
log.debug("Successfully handed off [%d] segments.", segments.size());
metrics.reportMaxSegmentHandoffTime(System.currentTimeMillis() - handoffStartTime);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add a log line here when the handoff time crosses a certain threshold? The log entry should have the segment id too. Let's say that a cluster admin gets an alert for high segment handoff time. For debugging the alert, one might look at the task logs and spot the culprit segment. what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does make sense, it would be nice to have the logs to clearly point to the problematic segments.

But I think the only option is logging this only when the numRemainingHandoffSegments is 0. We would also need to call out in the log that possibly only "some" of the segment ids (out of the ones in the commit metadata) are problematic ones. This should be okay as the commit metadata should not be likely to have too many segments in one batch.

Tapping into the Object result passed to the onSuccess to decide exactly which segment comes after the threshold doesn't seem viable either as the Futures seem to return weird objects (I didn't dig very deep though).

Another concern is the exact value of the threshold itself, which would have to be hard-coded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhishekagarwal87, as @kfaraz has pointed out, it seems difficult to get a good threshold since long coodinator times may affect the handoff period.
Does 15 or maybe 30 mins seem like a good threshold to avoid too many alerts on large clusters?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AmatyaAvadhanula , we just need to log here for someone debugging an issue to find. We need not raise an alert here. That said, a 10 or 15 min threshold seems fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log has been added

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, otherwise looks good.
+1 after adding the required logs and build passing.

@@ -30,6 +30,14 @@

void stop();


/**
* Useful to push a last round of metrics before stopping the monitor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
* Useful to push a last round of metrics before stopping the monitor
* Emit a last round of metrics using the given emitter and then stop the monitor.

Assert.assertFalse(monitor.isStarted());
boolean secondRound = monitor.monitor(emitter);
boolean thirdRound = monitor.monitor(emitter);
Assert.assertTrue(monitor.monitor(emitter));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much easier to read now!


long maxSegmentHandoffTime = metrics.maxSegmentHandoffTime();
if (maxSegmentHandoffTime >= 0) {
emitter.emit(builder.build("ingest/handoff/time", maxSegmentHandoffTime));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just keep the default as 0 rather than -1 and always emit the maxSegmentHandoffTime, even if it is 0. This would also match the behaviour of the other metrics being emitted here, especially handoff/count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this metric isn't of much value when segments aren't being handed off and emitting 0 wouldn't be very helpful.

@@ -332,6 +332,7 @@ public ListenableFuture<SegmentsAndCommitMetadata> registerHandoff(SegmentsAndCo
}

log.debug("Register handoff of segments: [%s]", waitingSegmentIdList);
long handoffStartTime = System.currentTimeMillis();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to clarify the intent of a final variable rather than relying on it being "effectively final"

Suggested change
long handoffStartTime = System.currentTimeMillis();
final long handoffStartTime = System.currentTimeMillis();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@kfaraz
Copy link
Contributor

kfaraz commented Oct 29, 2022

Looks much simpler now, @AmatyaAvadhanula .

@AmatyaAvadhanula AmatyaAvadhanula merged commit 650840d into apache:master Nov 7, 2022
@AmatyaAvadhanula
Copy link
Contributor Author

Merged since build failure was due to coverage for trivial logging code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants