-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve SegmentMetadataCache visibility #14453
Improve SegmentMetadataCache visibility #14453
Conversation
@@ -403,6 +403,13 @@ public void start() throws InterruptedException | |||
@VisibleForTesting | |||
void refresh(final Set<SegmentId> segmentsToRefresh, final Set<String> dataSourcesToRebuild) throws IOException | |||
{ | |||
long startMillis = System.currentTimeMillis(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't recommend using currentTimeMillis
in any new code. I'd suggest using nanoTime
and then converting it to the needed duration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you recommend nanoTime
, @rohangarg ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since that is always monotonically increasing in a JVM's lifetime, I don't think currentTimeMillis
is, since it comes from the system time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, interesting. It does seem that currentTimeMillis
can sometimes run backwards!
I still feel currentTimeMillis
would be a better fit here as it is a cheaper call and should be (guaranteed to be?) monotonic at least when called successively from the same thread in the absence of clock adjustments.
But either of the two approaches works for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the monotonicity depends on the NTP updates (on cloud instances), so not sure doing from same thread affects it. Also regarding cheapness, I think both calls are cheap and infrequent enough to not become a bottleneck for performance consideration. Further, the performance of each is also platform dependent I think.
I felt that for elapsed durations, probably something closer to a timer is better than using clock times.
Anyways, I'm ok with using any method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. nanoTime
seems like the better option for measuring elapsed duration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, @findingrish .
I wonder if these names would be better aligned with existing Druid metrics:
segment/metadatacache/refresh/time
segment/metadatacache/refresh/count
@@ -403,6 +403,13 @@ public void start() throws InterruptedException | |||
@VisibleForTesting | |||
void refresh(final Set<SegmentId> segmentsToRefresh, final Set<String> dataSourcesToRebuild) throws IOException | |||
{ | |||
long startMillis = System.currentTimeMillis(); | |||
|
|||
emitter.emit(ServiceMetricEvent.builder().build( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to emit the dimension dataSource
with this metric. So we should probably emit it from refreshSegmentsForDatasource
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The time
metric could be per datasource too. The time duration is already being tracked and logged, we just need to emit it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we report total/datasource value for both the metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, just add the dimension dataSource
while emitting the metric. Aggregation is not required. It can be summed up wherever the metric is being collected.
Thanks for the suggestion. What does Would something like |
We don't really have any guidelines in place yet. But if you look at the docs, metrics typically fall into one of these buckets:
There are some which could have probably fit into the above, so I wouldn't advocate them. But these are:
The service name need not be a part of the metric (i.e. I suggested The above are my opinions to reduce complexity for a user. You can choose some other nomenclature if you can reason about it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @findingrish , please update the metrics documentation too.
Edit: Please add a couple of tests to verify the emission of this metric.
|
||
emitter.emit(builder.build("segment/metadatacache/refresh/count", segments.size())); | ||
|
||
final long startNanos = System.nanoTime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This should probably be the first line of the method, so that we account for all the operations here.
@@ -783,10 +789,13 @@ private Set<SegmentId> refreshSegmentsForDataSource(final String dataSource, fin | |||
yielder.close(); | |||
} | |||
|
|||
long refreshDurationMillis = (System.nanoTime() - startNanos) / 1_000_000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: for readability
long refreshDurationMillis = (System.nanoTime() - startNanos) / 1_000_000; | |
long refreshDurationMillis = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startNanos); |
docs/operations/metrics.md
Outdated
@@ -66,6 +66,9 @@ Metrics may have additional dimensions beyond those listed above. | |||
|`sqlQuery/bytes`|Number of bytes returned in the SQL query response.|`id`, `nativeQueryIds`, `dataSource`, `remoteAddress`, `success`, `engine`| | | |||
|`init/serverview/time`|Time taken to initialize the broker server view. Useful to detect if brokers are taking too long to start.||Depends on the number of segments.| | |||
|`init/metadatacache/time`|Time taken to initialize the broker segment metadata cache. Useful to detect if brokers are taking too long to start||Depends on the number of segments.| | |||
|`segment/metadatacache/refresh/count`|Number of segments to refresh in broker segment metadata cache.|`dataSource`| | |||
|`segment/metadatacache/refresh/count`|Number of segments to refresh in broker segment metadata cache.|`dataSource`| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
sql/src/main/java/org/apache/druid/sql/calcite/schema/SegmentMetadataCache.java
Show resolved
Hide resolved
@Test | ||
public void testRefreshShouldEmitMetrics() throws InterruptedException | ||
{ | ||
ServiceEmitter mockEmitter = Mockito.mock(ServiceEmitter.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use instance of StubServiceEmitter
instead.
Assert.assertEquals(emittedEvents.get("segment/metadatacache/refresh/count") | ||
.getUserDims() | ||
.get(DruidMetrics.DATASOURCE), "some_datasource"); | ||
Assert.assertEquals(emittedEvents.get("segment/metadatacache/refresh/time") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using StubServiceEmitter.getMetricValues()
will simplify this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of verifying the exact value, I am verifying that the metrics are emitted with dataSource dimension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getMetricValues
can be used for that too because it allows you to pass a set of dimension filters. See some of the other tests where it is being used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I missed this. Anyways, have updated the code.
Map<String, Set<Object>> emittedEventsMap = | ||
emitter.getEvents() | ||
.stream() | ||
.map(ServiceMetricEvent.class::cast) | ||
.collect(Collectors.groupingBy( | ||
ServiceMetricEvent::getMetric, | ||
Collectors.mapping(sme -> sme.getUserDims().get(DruidMetrics.DATASOURCE), Collectors.toSet()) | ||
)); | ||
Assert.assertTrue(emittedEventsMap.containsKey("init/metadatacache/time")); | ||
Assert.assertTrue(emittedEventsMap.get("segment/metadatacache/refresh/count").size() > 0); | ||
Assert.assertTrue(emittedEventsMap.get("segment/metadatacache/refresh/time").size() > 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like this:
Map<String, Set<Object>> emittedEventsMap = | |
emitter.getEvents() | |
.stream() | |
.map(ServiceMetricEvent.class::cast) | |
.collect(Collectors.groupingBy( | |
ServiceMetricEvent::getMetric, | |
Collectors.mapping(sme -> sme.getUserDims().get(DruidMetrics.DATASOURCE), Collectors.toSet()) | |
)); | |
Assert.assertTrue(emittedEventsMap.containsKey("init/metadatacache/time")); | |
Assert.assertTrue(emittedEventsMap.get("segment/metadatacache/refresh/count").size() > 0); | |
Assert.assertTrue(emittedEventsMap.get("segment/metadatacache/refresh/time").size() > 0); | |
emitter.verifyEmitted("init/metadatacache/time", 1); | |
emitter.verifyEmitted("segment/metadatacache/refresh/time", ImmutableMap.of(DruidMetrics.DATASOURCE, 'yourds', 1)); |
public void testRefreshShouldEmitMetrics() throws InterruptedException | ||
{ | ||
StubServiceEmitter emitter = new StubServiceEmitter("broker", "host"); | ||
buildSchemaMarkAndTableLatch(SEGMENT_CACHE_CONFIG_DEFAULT, emitter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this, can we just instantiate a SchemaMetadataCache
and call refresh
from this test method itself?
@@ -707,14 +709,19 @@ private long recomputeIsRealtime(ImmutableSet<DruidServerMetadata> servers) | |||
private Set<SegmentId> refreshSegmentsForDataSource(final String dataSource, final Set<SegmentId> segments) | |||
throws IOException | |||
{ | |||
final long startTimeNanos = System.nanoTime(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you've given this enough thought - probably Guava's StopWatch
is a good utility to use here. It is also used at a couple of more place and provides the elapsed and conversion methods as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can consider this change, incase you're going to make other changes to the PR too. no point in running CI for this :)
New metrics: - `segment/metadatacache/refresh/time`: time taken to refresh segments per datasource - `segment/metadatacache/refresh/count`: number of segments being refreshed per datasource
Add metric
segment/metadatacache/refresh/time
that measures the time to call refresh(segmentsToRefresh, dataSourcesToRebuild). Emitted once per refresh per datasource.Add metric
segment/metadatacache/refresh/count
that measures the number of segments passed to refresh. Emitted once per refresh per datasource.