New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HOTFIX: Fixes to metric names #3491
Conversation
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two meta-comments: for processor-node level metrics, we do not need the task-id prefix as in the attribute name:
And in store-cache-level metrics, we do not need the task-id as the prefix of the record-cache-id as well:
Since the processor-id (e.g. KSTREAM-AGGREGATE-0000000003
) and the store name (e.g. Count
) respectively are uniquely identifiable throughout the topology.
@@ -117,7 +118,7 @@ public Sensor addLatencyAndThroughputSensor(String scopeName, String entityName, | |||
|
|||
// first add the global operation metrics if not yet, with the global tags only | |||
Sensor parent = metrics.sensor(sensorName(operationName, null), recordingLevel); | |||
addLatencyMetrics(scopeName, parent, "all", operationName, tagMap); | |||
addLatencyMetrics(scopeName, parent, null, operationName, tagMap("all", "all")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. This will result in the mbean object name as, for example:
kafka.streams:type=stream-task-metrics,client-id=streams-wordcount-e600ab70-9a78-4886-b2c7-bce8f9fdbcf2-StreamThread-1,all=all
and
kafka.streams:type=stream-task-metrics,client-id=streams-wordcount-e600ab70-9a78-4886-b2c7-bce8f9fdbcf2-StreamThread-1,streams-task-id=1_0
I think it is better to be "streams-task-id=all".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another issue is that, for state-store-metrics, since the per store metrics do not have tags, doing this will result in a bit mis-match hierarchies like this:
Note the "all" attributes is grouped with the other stores at the same level while the state store names will be used as the prefix for the per-store metrics, so if we have multiple rocksDB stores it will be added all under the global "stream-rocksdb-state-metrics" as separate metrics with different store names:
Instead what we want is to have "store-name" in the tags as well so that under stream-[store-type]-state-metrics
we will have a first layer of "storeName1", "storeName2" .. and "all" (whose tags would be "store-name", "all").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Also could you update the wed docs file for the streams metrics in this PR as well? |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
For All metrics have the |
docs/ops.html
Outdated
<td>The [average | maximum] commit time in ns for this task. </td> | ||
<td>commit-latency-avg</td> | ||
<td>The average commit time in ns for this task. </td> | ||
<td>kafka.streams:type=stream-task-metrics,streams-task-id=([-.\w]+)</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is incorrect: from I saw it is kafka.streams:type=stream-task-metrics,client-id=[thread-name],streams-task-id=[task-id]
.
Ditto for processor-node-level, and store-level metrics.
@@ -117,7 +118,7 @@ public Sensor addLatencyAndThroughputSensor(String scopeName, String entityName, | |||
|
|||
// first add the global operation metrics if not yet, with the global tags only | |||
Sensor parent = metrics.sensor(sensorName(operationName, null), recordingLevel); | |||
addLatencyMetrics(scopeName, parent, "all", operationName, tagMap); | |||
addLatencyMetrics(scopeName, parent, null, operationName, tagMap(tags != null && tags.length > 1 ? tags[0] : "all", "all")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these APIs are also exposed to users for them to construct their own metrics, in which case the tags
may not contain any values, but at the same time we are relying on entityName
to construct the parent-child metrics hierarchy. So we should handle the construction of the tags here than enforcing the callers to pass in null
as entity name and with the additional tags constructed inside the function.
@guozhangwang I don't think this is a metrics problem as such. I simply use (consistently) the name of the node or cache. It just so happens that our way of naming caches includes the task ID in them. By the time we reach the metrics code, I don't have any control on the cache name. |
@guozhangwang the source of the problems is the way we assign names to nodes, caches and stores. That itself is inconsistent. I can't solve it at the metrics level if it is inconsistent at a higher level:
|
Yes I understand that we use
Just to follow what I meant above: I think we do not necessarily enforce ourselves to use the id of the node / cache / .. as the tags of "xx-id=yy". In addition, that does not work well with the parent-child metrics hierarchy: originally we infer the hierarchy from the So what I had previously in mind is that, for the APIs like
3.1) For per-thread metrics, this is the top-level metrics and we are not using the public APIs but are calling
So we have one metric per-thread, and we do not have a parent metric over all threads (since in practice very likely one thread per instance). 3.2) For per-task metrics, this is the second layer metrics and we are using the public APIs. We can pass the parameters as
So we have one metric per-task-per-thread, and the parent metrics aggregates over all tasks per thread. 3.3) For per-processor-node metrics (note this is one lever lower than the per-task since our hierarchy is thread-tasks-processorNode/stateStore) similarly we are using the public APIs; BUT the processor node id itself is not global unique since multiple tasks have the same processor node. So we can pass the parameters as
I.e. we will have one metrics per-processor-per-task-per-thread, and the parent metric will be aggregate over all processor nodes per-task-per-thread. 3.4) / 3.5) For per-store / per-cache metrics, they are at the similar level of the per-processor-node metrics. Similarly we pass the
Note each of the child metric is per-store/cache-per-task-per-thread, and the parent aggregates over all the state types or caches per-task-per-thread.
I.e. one metric across threads / tasks per the entity, and one parent aggregated across all the entities of the same scope. If users want to have this metric per-task / thread as well, then we can state that in the javadocs of Do this whole make sense to you? |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
@guozhangwang I'll need to update docs, but see if code makes more sense first. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resulted metrics lgtm. Left some comments.
When updating the web docs please also describe the hierarchies (thread -> task -> processor-node / state-stores / caches) for users.
@@ -94,15 +97,16 @@ private String sensorName(String operationName, String entityName) { | |||
} | |||
} | |||
|
|||
private Map<String, String> tagMap(String... tags) { | |||
public Map<String, String> tagMap(String... tags) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to expose this as a public function in order to be used in NamedCache.java
, we can make it as a static public function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It uses the default tags from the StreamsMetricsImpl
class so it is hard to make static.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, nvm.
List<String> updatedTagList = new ArrayList(Arrays.asList(tags)); | ||
updatedTagList.add(scopeName + "-id"); | ||
updatedTagList.add(entityName); | ||
Map<String, String> tagMap = tagMap(updatedTagList.toArray(new String[updatedTagList.size()])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified as
Map<String, String> tagMap = tagMap(scopeName + "-id", entityName);
Map<String, String> allTagMap = tagMap("all", entityName);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I need to add the original tags as well, not just those two lines. I've cleaned up though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then how about:
Map<String, String> tagMap = tagMap(scopeName + "-id", entityName, tags);
Map<String, String> allTagMap = tagMap("all", entityName, tags);
?
List<String> updatedTagList = new ArrayList(Arrays.asList(tags)); | ||
updatedTagList.add(scopeName + "-id"); | ||
updatedTagList.add(entityName); | ||
Map<String, String> tagMap = tagMap(updatedTagList.toArray(new String[updatedTagList.size()])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto here.
addThroughputMetrics(scopeName, sensor, entityName, opName, tags); | ||
} | ||
|
||
private void addThroughputMetrics(String scopeName, Sensor sensor, String entityName, String opName, Map<String, String> tags) { | ||
maybeAddMetric(sensor, metrics.metricName(entityName + "-" + opName + "-rate", groupNameFromScope(scopeName), | ||
"The average number of occurrence of " + entityName + " " + opName + " operation per second.", tags), new Rate(new Count())); | ||
maybeAddMetric(sensor, metrics.metricName(opName + "-rate", groupNameFromScope(scopeName), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
entityName
parameter is not needed right?
maybeAddMetric(sensor, metrics.metricName(entityName + "-" + opName + "-latency-max", groupNameFromScope(scopeName), | ||
"The max latency of " + entityName + " " + opName + " operation.", tags), new Max()); | ||
|
||
maybeAddMetric(sensor, metrics.metricName(opName + "-latency-avg", groupNameFromScope(scopeName), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as addThroughputMetrics
, entityName
is not needed.
@@ -77,6 +77,28 @@ public long flushes() { | |||
} | |||
|
|||
/** | |||
* The thread cache maintains a set of caches whose names are a concatenation of the task ID and the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Javadoc: we can refer to {link NamedCache}
, and also we need to specify the purpose of the function itself, i.e. "return the name of the cache as..." instead of explaining the class.
this.cache = this.context.getCache(); | ||
this.cacheName = this.cache.nameSpaceFromTaskIdAndStore(context.taskId().toString(), underlying.name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since nameSpaceFromTaskIdAndStore
is static function, we should use ThreadCache#nameSpaceFromTaskIdAndStore
.
final Sensor hitRatioSensor; | ||
|
||
private void maybeAddMetric(Sensor sensor, MetricName name, MeasurableStat stat) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function can be part of NamedCacheMetrics
itself.
// add parent | ||
Sensor parent = this.metrics.registry().sensor(opName, Sensor.RecordingLevel.DEBUG); | ||
maybeAddMetric(parent, this.metrics.registry().metricName(opName + "-avg", groupName, | ||
"The average cache hit ratio of " + name, allMetricTags), new Avg()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we do not need the of name
part anymore in the metric description, since it is already under the tags. Just The average cache hit ratio
is good enough. Ditto elsewhere.
final String opName = "hitRatio"; | ||
final String tagKey = "record-cache-id"; | ||
final String tagValue = name; | ||
final String tagValue = ThreadCache.underlyingStoreNamefromCacheName(name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to keep tagKey = scope + "-id";
in case we change scope
in the future.
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
@guozhangwang anything else left? Thanks. |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
A couple of fixes to metric names to match the KIP - Removed extra strings in the metric names that are already in the tags - add a separate metric for "all" Author: Eno Thereska <eno.thereska@gmail.com> Reviewers: Guozhang Wang <wangguoz@gmail.com> Closes #3491 from enothereska/hotfix-metric-names (cherry picked from commit 6bee1e9) Signed-off-by: Guozhang Wang <wangguoz@gmail.com>
LGTM. Merged to trunk and cherry-picked to 0.11.0. Thanks @enothereska . |
A couple of fixes to metric names to match the KIP