KAFKA-6820: Refactor Stream Metrics #6498

guozhangwang · 2019-03-25T07:48:23Z

Keep each level's sensor creation and metrics registration / de-registration logic inside the XXMetrics class itself. Making StreamsMetricsImpl a util function centric plus enwrapping the Metrics registry which is going to be shared for all the XXXMetrics (KAFKA-6819)

1.a Each XXMetrics object will keep track of all the sensors it has ever created, and a clear function which must be called when the corresponding object (a thread, a task, etc) is closed.

1.b From anywhere inside the code base, as long as the class can access the StreamsMetricsImpl class it can use the util functions to get the sensor it wants to record from the metrics registry following the protocol. Only when it cannot access StreamsMetricsImpl the sensor need to be pre-accessed and passed through the constructor.

1.c De-couple the original thread-level sensors as StreamThreadMetrics , extracted from the StreamsMetricsImpl class -- hence the latter can become a pure util functions provider and not keep track of any sensors.

Make metric / group / tag names constant strings, residing either in each level's metrics class if they belong to specific sensors, and in StreamsMetricsImpl if they are util strings like suffix / prefix.
Change public StreamsMetrics interface to be more intuitive for users (KAFKA-6820). And make all internal usages to leverage on the provided util functions of StreamsMetricsImpl as well.
Remove default parent-sensors from everywhere.
TEST: use getMetricByName and getMetricByNameAndTags from the utils class across all unit test classes.
MINOR: because of 1), MockStreamsMetrics is not needed any more and can be replaced with StreamsMetricsImpl now since the latter is now a stateless util layer (tech debt KAFKA-5676).

NOTE one thing's missing here is the compatibility path with old metrics, which will be done in a follow-up PR.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…factor-thread-metrics

guozhangwang · 2019-03-25T07:48:36Z

@bbejeck @vvcephei @mjsax for a quick look.

guozhangwang · 2019-03-25T17:30:28Z

streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamAggregate.java


        @SuppressWarnings("unchecked")
        @Override
        public void init(final ProcessorContext context) {
            super.init(context);
-            metrics = (StreamsMetricsImpl) context.metrics();
+
+            skippedRecordSensor = ((InternalProcessorContext) context).currentNode().nodeMetrics().skippedRecordsRateSensor();


We need to access the skipped-records sensor as process-node level now, and since processor cannot access processor-node which wraps it, we need to get it from InternalProcessorContext (ditto elsewhere). A minor impact is for unit tests which uses ProcessorContext that does not extend InternalProcessorContext -- I've only found one unit test needed for updates because of this.

guozhangwang · 2019-03-25T17:31:05Z

.../src/main/java/org/apache/kafka/streams/kstream/internals/KStreamSessionWindowAggregate.java

@@ -75,22 +74,21 @@ public void enableSendingOldValues() {
        sendOldValues = true;
    }

-    private class KStreamSessionWindowAggregateProcessor extends AbstractProcessor<K, V> {
+    public class KStreamSessionWindowAggregateProcessor extends AbstractProcessor<K, V> {


This class need to be accessed in the instanceof condition for lazy creation of sensors at NodeMetrics, ditto elsewhere.

Is this maybe left over from a previous iteration? I couldn't find the condition in NodeMetrics.

It's in ProcessorNodeMetrics.

guozhangwang · 2019-03-25T17:32:22Z

streams/src/main/java/org/apache/kafka/streams/kstream/internals/metrics/Sensors.java

-import static org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.PROCESSOR_NODE_ID_TAG;
-import static org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.PROCESSOR_NODE_METRICS_GROUP;
-
-public class Sensors {


We moved the sensor creation protocol into the corresponding NodeMetrics. Ditto for another Sensors class as well.

guozhangwang · 2019-03-25T17:33:35Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateUpdateTask.java

@@ -69,7 +72,8 @@ public GlobalStateUpdateTask(final ProcessorTopology topology,
                    source,
                    deserializationExceptionHandler,
                    logContext,
-                    processorContext.metrics().skippedRecordsSensor()
+                    // task-id would be "-1_-1"
+                    processorContext.metrics().taskLevelSensor(SKIPPED_RECORDS, processorContext.taskId().toString(), Sensor.RecordingLevel.INFO)


We need a task-id for the global-state-update-task for its skipped-records metrics.

guozhangwang · 2019-03-25T17:34:13Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/ProcessorNode.java

            if (processor != null) {
                processor.init(context);
            }
-            nodeMetrics.nodeCreationSensor.record(time.nanoseconds() - startNs);


This metric is removed as proposed. Ditto elsewhere.

guozhangwang · 2019-03-25T17:47:44Z

streams/src/test/java/org/apache/kafka/streams/kstream/internals/KTableKTableInnerJoinTest.java

        final LogCaptureAppender appender = LogCaptureAppender.createAndRegister();
-        join.process(null, new Change<>("new", "old"));
-        LogCaptureAppender.unregister(appender);
+        try (final TopologyTestDriver driver = new TopologyTestDriver(builder.build(), props)) {


This change is needed to use InternalProcessorContext. Ditto below.

guozhangwang · 2019-03-25T17:49:26Z

.../org/apache/kafka/streams/kstream/internals/suppress/KTableSuppressProcessorMetricsTest.java

-        context.setCurrentNode(new ProcessorNode("testNode"));
+        final ProcessorNode processorNode = new ProcessorNode(TEST_NODE);
+        context.setCurrentNode(processorNode);
+        processorNode.init(context);


We need to initialize the processor node in order to create the NodeMetrics so that it can be used for adding sensors later in the processor class, ditto in a few other places.

guozhangwang · 2019-03-25T17:51:47Z

streams/src/test/java/org/apache/kafka/streams/processor/internals/ProcessorNodeTest.java

-                                                               "The average number of occurrence of " + throughputOperation + " operation per second.",
-                                                               metricTags)));
-
-        final JmxReporter reporter = new JmxReporter("kafka.streams");


I removed the part to test JmxReporter since it should be covered in JmxReporterTest, so removing them to cleanup some overlapped overage. Ditto elsewhere.

guozhangwang · 2019-03-25T17:53:33Z

...c/test/java/org/apache/kafka/streams/processor/internals/metrics/StreamsMetricsImplTest.java

-    }
-
-    @Test
-    public void testMutiLevelSensorRemoval() {


We do not have by-default parent sensor now, and the rest of special parent sensors are covered in StreamMetricsIntegrationTest already.

guozhangwang · 2019-03-25T17:54:30Z

streams/src/test/java/org/apache/kafka/streams/state/internals/InMemoryWindowStoreTest.java

+    private final Properties props = StreamsTestUtils.getStreamsConfig();
+    private final StreamsConfig config = new StreamsConfig(props);
+    private final StreamsMetricsImpl streamsMetrics = new StreamsMetricsImpl(
+        new Metrics(new MetricConfig().recordLevel(Sensor.RecordingLevel.forName(config.getString(StreamsConfig.METRICS_RECORDING_LEVEL_CONFIG))))


We need to enable DEBUG reporting level to test some lower-level sensors now. Ditto elsewhere.

vvcephei

Hey @guozhangwang ,

Thanks for tackling this! I made a quick pass and left some thoughts. Overall, I think this is a really good approach.

vvcephei · 2019-03-28T03:05:13Z

streams/src/main/java/org/apache/kafka/streams/StreamsMetrics.java

+     * Note that you can add more metrics to this sensor after created it, which can then be updated upon {@link Sensor#record(double)} calls;
+     * but additional user-customized metrics will not be managed by {@link StreamsMetrics}.
+     *
+     * @param scopeName          name of the scope, which will be used as part of the metrics type, e.g.: "stream-[scope]-metrics".


Thanks for documenting how the arguments are used in the metrics

vvcephei · 2019-03-28T03:09:18Z

streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamAggregate.java


        @SuppressWarnings("unchecked")
        @Override
        public void init(final ProcessorContext context) {
            super.init(context);
-            metrics = (StreamsMetricsImpl) context.metrics();
+
+            skippedRecordSensor = ((InternalProcessorContext) context).currentNode().nodeMetrics().skippedRecordsRateSensor();


vvcephei · 2019-03-28T03:27:56Z

.../src/main/java/org/apache/kafka/streams/kstream/internals/KStreamSessionWindowAggregate.java

@@ -75,22 +74,21 @@ public void enableSendingOldValues() {
        sendOldValues = true;
    }

-    private class KStreamSessionWindowAggregateProcessor extends AbstractProcessor<K, V> {
+    public class KStreamSessionWindowAggregateProcessor extends AbstractProcessor<K, V> {


Is this maybe left over from a previous iteration? I couldn't find the condition in NodeMetrics.

vvcephei · 2019-03-28T03:28:36Z

streams/src/main/java/org/apache/kafka/streams/kstream/internals/metrics/Sensors.java

-import static org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.PROCESSOR_NODE_ID_TAG;
-import static org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.PROCESSOR_NODE_METRICS_GROUP;
-
-public class Sensors {


vvcephei · 2019-03-28T03:51:27Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/ProcessorNode.java

+        public Sensor skippedRecordsRateSensor() {
+            if (skippedRecordsRateSensor == null) {
+                // keep the task-level parent sensor
+                final Sensor taskLevelSensor = metrics.taskLevelSensor(SKIPPED_RECORDS, taskName, Sensor.RecordingLevel.INFO);


Do we have a responsibility to remove this sensor on clear as well? Or does it get removed elsewhere? (I haven't looked)

Yes it will be cleared at the task-level TaskMetrics.

More specifically, at this place the task sensor should always be created already, and hence to be on the safer side, we should to something like getButNotCreate apis.

vvcephei · 2019-03-28T03:58:26Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamTask.java

-                new CumulativeCount()
-            );
+            // we have to separate the latency-max/avg from the rate/total since the latter would be recorded at child node-level
+            processLatencySensor = metrics.taskLevelSensor(PROCESS_LATENCY, taskName, Sensor.RecordingLevel.DEBUG);


vvcephei · 2019-03-28T04:01:28Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamTask.java

@@ -361,7 +351,13 @@ public boolean process() {
            log.trace("Start processing one record [{}]", record);

            updateProcessorContext(record, currNode);
-            currNode.process(record.key(), record.value());
+
+            StreamsMetricsImpl.maybeMeasureLatency(


vvcephei · 2019-03-28T04:07:29Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamTask.java

+                commitRequested = false;
+            },
+            time,
+            taskMetrics.commitLatencySensor);


This long lambda has me wondering if it would be nicer to make the lambda the last argument for readability.

vvcephei · 2019-03-28T04:48:36Z

streams/src/main/java/org/apache/kafka/streams/state/internals/StoreMetrics.java

+import static org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.ID_SUFFIX;
+import static org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.LATENCY_SUFFIX;
+
+public class StoreMetrics {


vvcephei · 2019-04-13T19:58:58Z

Hey @guozhangwang , I was talking about time semantics in another thread with @mjsax , and we were thinking it might make sense to piggy-back one other thing on this.

Right now, we make a distinction between "dropped" and "skipped" records in the metrics. "Skipped" records are invalid records (e.g., null key, or a serialization error), whereas "dropped" records are valid records that we drop because they arrive after their window is closed, in windowed aggregations.

But this distinction is subtle. What we think matters to an operator is two questions: How many records got dropped/skipped, and for what reasons? Should we pick one of the terms and standardize on it, like:

late-record-skip
invalid-record-skip
null-key-record-skip
etc.?

WDYT?

guozhangwang · 2019-04-15T16:44:01Z

But this distinction is subtle. What we think matters to an operator is two questions: How many records got dropped/skipped, and for what reasons? Should we pick one of the terms and standardize on it, like:

late-record-skip

invalid-record-skip

null-key-record-skip
etc.?

WDYT?

I had the same feeling "dropped" and "skipped" this when doing the refactoring effort. I agree that having a consistent terminology is better, while we still make it doable for those who want to have finer granularity information on the causes etc.

There's a not-so-subtle difference between some of those scenarios though: for example, in deserialize-error, the record is dropped at the very beginning of the topology without being processed at all, whereas for null-key / late records, it may have been traversed half-way into the topology and hence resulted some side effects on the state before being dropped.

Hence as you see for current skip metrics, it has some recordings on task (the former case) as well as on processor-node (the latter case), whereas drop metrics are on processor-node only.

I'm thinking maybe we can still distinguish the task-level dropped and processor-node level skipped metrics in the names? I.e. having a set of metrics suffixed "skip" for the former (currently maybe only one metric deser-error-skip is needed) and another set of metrics suffixed "drop" for the latter.

Part of the reason is another similar metric is for windowed stores, where we have an expired-window-record-drop metric: its semantics is somewhat different as it's based on the retention time of the store, which is orthogonal to the processor logic. However if users want to know how many records were skipped in total (note that the processor node will still forward to downstream etc, just that its put calls into the store will be a no-op), they'll have to add those as well since they will not be reflected on the previous metrics. So I feel having the same suffix drop can hint the users they may want to add those up as well.

vvcephei · 2019-04-15T19:22:45Z

Thanks, @guozhangwang . Your reasoning sounds good to me. Of course, we should document this distinction alongside the metrics so that people can actually realize the benefit of being able to reason about drops/skips this way (and so I can remember it in 6 months, as well).

guozhangwang · 2019-10-15T18:43:53Z

Closing this PR as it has been addressed by @cadonna

guozhangwang added 14 commits March 11, 2019 17:53

refactor user-facing StreamsMetrics

4404324

thread metrics out

02a4c2a

refactor task metrics

2d1a7d8

node level metrics refactoring

134c6dd

minor

57d01fa

Merge branch 'trunk' of https://github.com/apache/kafka into K6820-re…

d815be6

…factor-thread-metrics

Merge branch 'trunk' of https://github.com/apache/kafka into K6820-re…

3ab268f

…factor-thread-metrics

add static fields

678b125

store metrics

2dc82f4

remove store sensors

d55e29e

minor

bcd47bc

complete clean-up

bb0e6e3

cleanup unit tests

fa222fd

rebase from trunk

1119311

guozhangwang commented Mar 25, 2019

View reviewed changes

bbejeck added the streams label Mar 25, 2019

vvcephei reviewed Mar 28, 2019

View reviewed changes

guozhangwang changed the title ~~[DO NOT MERGE] KAFKA-6820: Refactor Stream Metrics~~ KAFKA-6820: Refactor Stream Metrics Mar 29, 2019

guozhangwang mentioned this pull request Apr 25, 2019

KAFKA-6819: Pt. 1 - Refactor thread-level Streams metrics #6631

Merged

3 tasks

guozhangwang closed this Oct 15, 2019

guozhangwang deleted the K6820-refactor-thread-metrics branch April 24, 2020 23:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-6820: Refactor Stream Metrics #6498

KAFKA-6820: Refactor Stream Metrics #6498

guozhangwang commented Mar 25, 2019 •

edited

guozhangwang commented Mar 25, 2019

guozhangwang Mar 25, 2019

vvcephei Mar 28, 2019

guozhangwang Mar 25, 2019

vvcephei Mar 28, 2019

guozhangwang Mar 29, 2019

guozhangwang Mar 25, 2019

vvcephei Mar 28, 2019

guozhangwang Mar 25, 2019

guozhangwang Mar 25, 2019

guozhangwang Mar 25, 2019

guozhangwang Mar 25, 2019

guozhangwang Mar 25, 2019

guozhangwang Mar 25, 2019

guozhangwang Mar 25, 2019

vvcephei left a comment

vvcephei Mar 28, 2019

vvcephei Mar 28, 2019

vvcephei Mar 28, 2019

vvcephei Mar 28, 2019

vvcephei Mar 28, 2019

guozhangwang Mar 29, 2019

vvcephei Mar 28, 2019

vvcephei Mar 28, 2019

vvcephei Mar 28, 2019

vvcephei Mar 28, 2019

vvcephei commented Apr 13, 2019

guozhangwang commented Apr 15, 2019

vvcephei commented Apr 15, 2019

guozhangwang commented Oct 15, 2019

KAFKA-6820: Refactor Stream Metrics #6498

KAFKA-6820: Refactor Stream Metrics #6498

Conversation

guozhangwang commented Mar 25, 2019 • edited

Committer Checklist (excluded from commit message)

guozhangwang commented Mar 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vvcephei left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vvcephei commented Apr 13, 2019

guozhangwang commented Apr 15, 2019

vvcephei commented Apr 15, 2019

guozhangwang commented Oct 15, 2019

guozhangwang commented Mar 25, 2019 •

edited