KAFKA-3715: add granular metrics per node #1446

aartigupta · 2016-05-30T05:06:28Z

Kafka Streams: add granular metrics per node, also expose ability to …
…register non latency metrics in StreamsMetrics

from #1362 (comment)
We can consider adding metrics for process / punctuate / commit rate at the granularity of each processor node in addition to the global rate mentioned above. This is very helpful in debugging.

We can consider adding rate / total cumulated metrics for context.forward indicating how many records were forwarded downstream from this processor node as well. This is helpful in debugging.

We can consider adding metrics for each stream partition's timestamp.
This is helpful in debugging.

Besides the latency metrics, we can also add throughput latency in terms of source records consumed.

More discussions here https://issues.apache.org/jira/browse/KAFKA-3715

…register non latency metrics in StreamsMetrics From apache#1362 (comment) We can consider adding metrics for process / punctuate / commit rate at the granularity of each processor node in addition to the global rate mentioned above. This is very helpful in debugging. We can consider adding rate / total cumulated metrics for context.forward indicating how many records were forwarded downstream from this processor node as well. This is helpful in debugging. We can consider adding metrics for each stream partition's timestamp. This is helpful in debugging. Besides the latency metrics, we can also add throughput latency in terms of source records consumed. More discussions here https://issues.apache.org/jira/browse/KAFKA-3715

…register non latency metrics in StreamsMetrics From apache#1362 (comment) ******************************************************************** We can consider adding metrics for process / punctuate / commit rate at the granularity of each processor node in addition to the global rate mentioned above. This is very helpful in debugging. We can consider adding rate / total accumulated metrics for context.forward indicating how many records were forwarded downstream from this processor node as well. This is helpful in debugging. We can consider adding metrics for each stream partition's timestamp. This is helpful in debugging. Besides the latency metrics, we can also add throughput latency in terms of source records consumed. ******************************************************** More discussions here https://issues.apache.org/jira/browse/KAFKA-3715

aartigupta · 2016-05-31T21:14:52Z

@guozhangwang, what do you think? I was able to run the examples and see the metrics per node in a jmx console.

guozhangwang · 2016-05-31T22:05:17Z

Thanks @aartigupta , @enothereska could you take a look first at this ticket? I have assigned you as the reviewer on the ticket, and please feel free to re-assign to me otherwise.

enothereska · 2016-06-02T14:13:40Z

core/src/test/scala/unit/kafka/server/AbstractFetcherThreadTest.scala


    fetcherThread.shutdown()
  }

-  private def allMetricsNames = Metrics.defaultRegistry().allMetrics().asScala.keySet.map(_.getName)
+  private def allMetricNames = Metrics.defaultRegistry().allMetrics().asScala.keySet.map(_.getName)


These name changes are not strictly part of this fix, I'm wondering if we can open a MINOR pr for these while having this PR focus on streams only (to avoid confusion).

Agreed, theses were not intended for this fix, they managed to sneak their way in. My bad, fixed it now

enothereska · 2016-06-02T14:14:46Z

@aartigupta perhaps the PR name should be "KAFKA-3715: add granular metrics per node"? The JIRA number is usually part of the PR name. Minor thing but just for consistency.

…register non latency metrics in StreamsMetrics From apache#1362 (comment) ******************************************************************** We can consider adding metrics for process / punctuate / commit rate at the granularity of each processor node in addition to the global rate mentioned above. This is very helpful in debugging. We can consider adding rate / total accumulated metrics for context.forward indicating how many records were forwarded downstream from this processor node as well. This is helpful in debugging. We can consider adding metrics for each stream partition's timestamp. This is helpful in debugging. Besides the latency metrics, we can also add throughput latency in terms of source records consumed. ******************************************************** More discussions here https://issues.apache.org/jira/browse/KAFKA-3715

jklukas · 2016-06-06T16:32:36Z

streams/src/main/java/org/apache/kafka/streams/processor/internals/ProcessorNode.java

+
+
+        public NodeMetricsImpl(StreamsMetrics metrics, String
+            name) {


The line break here seems unnecessary.

enothereska · 2016-06-06T19:25:36Z

Thanks @aartigupta . Two higher level questions: does it make sense to add a unit test or two for the new metrics? And do we have any overhead measurements in the sense of how much to the new recordings add to the end to end latency?

aartigupta · 2016-06-13T04:43:05Z

Ran org.apache.kafka.streams.perf.SimpleBenchmark with the following configuration (i.e. without state store backed streams and simple print statements indicating which part of the benchmark is being run)

   System.out.println("producer");
    benchmark.produce();
    System.out.println("consumer ");
    benchmark.consume();
    System.out.println("simple stream performance source->process");
    // simple stream performance source->process
    benchmark.processStream();
    System.out.println("simple stream performance source->sink");
    // simple stream performance source->sink
    benchmark.processStreamWithSink();
    // simple stream performance source->store

// benchmark.processStreamWithStateStore();

then attached a yourkit profiler and saw the following differences (see attached screenshots)

without any changes to the code and using CPU Sampling in yourkit saw 61% cpu contention
with the per node metrics and using CPU sampling in yourkit, saw 70% cpu contention,

without any changes

org.apache.kafka.streams.perf.SimpleBenchmark
producer
Producer Performance [MB/sec write]: 8.853212193170378
consumer
[YourKit Java Profiler 2016.02-b38] Log file: /Users/aartikumargupta/.yjp/log/SimpleBenchmark-1754.log
Consumer Performance [MB/sec read]: 4.596191726854892
simple stream performance source->process
Streams Performance [MB/sec read]: 14.361679855964493
simple stream performance source->sink
Streams Performance [MB/sec read+write]: 4.535097423059803

with node metrics
Producer Performance [MB/sec write]: 5.035256582778346
consumer
[YourKit Java Profiler 2016.02-b38] Log file: /Users/aartikumargupta/.yjp/log/SimpleBenchmark-1549.log
Consumer Performance [MB/sec read]: 2.751484036579496
simple stream performance source->process
Streams Performance [MB/sec read]: 8.014018691588785
simple stream performance source->sink
Streams Performance [MB/sec read+write]: 6.562667414985077

Ran this multiple times and the results varied between 63%(no changes) and 72%(with per node metrics) The difference seems to be around the point at which yourkit profiler is attached

That said, not sure if this is a valid load simulating scenario
@guozhangwang mentions in #1490 that

if your traffic is very small and consumer is already at the log tail throughout your test, it will cause the polling / processing to be called with less batched data and hence further increased overhead.

@guozhangwang Is the simpleBenchmark a good scenario to be profiling ?
If not any suggestions on another scenario, maybe we can add (check in) such a scenario under examples, which can be used for all similar future profiling exercises

Still working on the unit tests for per node metrics.

gfodor · 2016-06-21T08:41:38Z

hey @aartigupta it's kind of hard to tell based on your screenshots where the time is going since I don't see any drilldown into the call stacks of the StreamThread run loops. It's probably necessary for you to flip things on in the YourKit profiler so you can get the full call stacks and determine if Sensor.record is the source of most of the time.

guozhangwang · 2016-06-21T23:29:08Z

Thanks @aartigupta , some general comments:

For naming consistency as with other metrics objects, for finer grained metrics we tend to name the sensors as "level-name.level-id.metrics-name", for example in SenderMetrics we used topic.[topic-name].records-per-batch etc for per topic-level metrics and in SelectorMetrics we used node-[node-id].bytes-sent etc for per node-level metrics, and in my latest PR KAFKA-3769: Create new sensors per-thread in KafkaStreams #1530 I was doing similar naming. You may already notice that this is for creating different sensors as we synchronize at the per-sensor basis, and since in producer / consumer we always has single-thread, today we do not have any contentions for the lock yet, and in Streams we are trying to add per-thread metrics and consider adding global metrics only after the syncrhonization is removed in KAFKA-3155 since as we have discussed in other PRs with multiple threads contention overhead can be large.
Different metrics reporter has the freedom of constructing their reporting metrics name from the hierarchy of "metrics-prefix, group-name, metrics-name, metrics-tags" where metrics-prefix are "kafka.producer" / "kafka.consumer" / "kafka.streams" depending on which client library you are using. And in this case the sensor names are actually ignored as they are used internally of the metrics object for grouping different metrics only. For example in JmxReporter we create the mbeanName / attributeName as

mbean: "metrics-prefix": type="group-name", "tag1key"="tag1value", ..., "tagNkey"="tagNvalue"
    attribute1: "metrics-name1"
    attribute2: "metrics-name2"
    ...

So we need to make sure that the hierarchy is sufficient for different reporters to differentiate these metrics in their own space.

guozhangwang · 2016-06-21T23:39:13Z

Btw the SimpleBenchmark numbers are pretty low compared to my laptop (4GB memory, and low-end CPUs). What environment did you run the profiler?

aartigupta · 2016-06-23T17:33:13Z

@guozhangwang Mackbook 12 inch 2015 early edition, 1.3GHz dual-core Intel Core M processor (Turbo Boost up to 2.9GHz) with 4MB shared L3 cache.
8GB of 1600MHz LPDDR3 onboard memory
I think that it has to do with attaching yourkit profiler.
Without the profiler I get the following

producer
Producer Performance [MB/sec write]: 22.247686586525987
consumer
Consumer Performance [MB/sec read]: 56.39283169836138
simple stream performance source->process
Streams Performance [MB/sec read]: 40.33237957119899
simple stream performance source->sink
Streams Performance [MB/sec read+write]: 18.71113212350438

Process finished with exit code 0

theduderog · 2016-06-29T01:43:32Z

Is there a way to register user-defined metrics?

enothereska · 2016-11-28T12:10:00Z

@aartigupta would you still have time for this PR or should I have a look? Thanks.

asfbot · 2017-01-10T20:10:55Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/685/
Test FAILed (JDK 7 and Scala 2.10).

asfbot · 2017-01-10T21:33:33Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/695/
Test PASSed (JDK 8 and Scala 2.12).

asfbot · 2017-01-10T21:33:52Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/697/
Test PASSed (JDK 8 and Scala 2.11).

asfbot · 2017-01-10T21:35:21Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/695/
Test FAILed (JDK 7 and Scala 2.10).

asfbot · 2017-01-11T09:50:14Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/721/
Test PASSed (JDK 7 and Scala 2.10).

asfbot · 2017-01-11T10:09:57Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/721/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2017-01-11T10:13:30Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/723/
Test FAILed (JDK 8 and Scala 2.11).

…tests

asfbot · 2017-01-11T11:25:08Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/729/
Test FAILed (JDK 8 and Scala 2.11).

asfbot · 2017-01-11T11:25:19Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/727/
Test FAILed (JDK 8 and Scala 2.12).

asfbot · 2017-01-11T11:25:36Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/727/
Test FAILed (JDK 7 and Scala 2.10).

asfbot · 2017-01-11T12:08:53Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/730/
Test PASSed (JDK 8 and Scala 2.11).

asfbot · 2017-01-11T12:58:08Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/728/
Test PASSed (JDK 8 and Scala 2.12).

asfbot · 2017-01-11T13:26:00Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/728/
Test PASSed (JDK 7 and Scala 2.10).

guozhangwang · 2017-01-11T17:30:33Z

clients/src/main/java/org/apache/kafka/common/metrics/Sensor.java

-     * Record a value with this sensor
-     * @param value The value to record
-     * @throws QuotaViolationException if recording this value moves a metric beyond its configured maximum or minimum
+     * Record a name with this sensor


Is this intentional? Ditto below.

asfbot · 2017-01-11T18:17:00Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/745/
Test PASSed (JDK 8 and Scala 2.11).

asfbot · 2017-01-11T18:21:50Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk7-scala2.10/743/
Test FAILed (JDK 7 and Scala 2.10).

enothereska · 2017-01-11T18:22:42Z

unrelated: kafka.api.SslProducerSendTest.testCloseWithZeroTimeoutFromSenderThread

enothereska · 2017-01-11T19:29:48Z

The old org.apache.kafka.streams.integration.ResetIntegrationTest failure is back but shouldn't be related to PR.

guozhangwang · 2017-01-11T19:59:40Z

Merged to trunk. Many thanks to @aartigupta and @enothereska !!

… logging levels to Metrics Kafka Streams: add granular metrics per node and per task, also expose ability to register non latency metrics in StreamsMetrics Also added different recording levels to Metrics. This is joint contribution from Eno Thereska and Aarti Gupta. from #1362 (comment) We can consider adding metrics for process / punctuate / commit rate at the granularity of each processor node in addition to the global rate mentioned above. This is very helpful in debugging. We can consider adding rate / total cumulated metrics for context.forward indicating how many records were forwarded downstream from this processor node as well. This is helpful in debugging. We can consider adding metrics for each stream partition timestamp. This is helpful in debugging. ## Besides the latency metrics, we can also add throughput latency in terms of source records consumed. More discussions here https://issues.apache.org/jira/browse/KAFKA-3715, KIP-104, KIP-105 Author: Eno Thereska <eno@confluent.io> Author: Aarti Gupta <aartiguptaa@gmail.com> Reviewers: Greg Fodor, Ismael Juma, Damian Guy, Guozhang Wang Closes #1446 from aartigupta/trunk

asfbot · 2017-01-11T20:35:17Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-jdk8-scala2.12/743/
Test FAILed (JDK 8 and Scala 2.12).

… logging levels to Metrics Kafka Streams: add granular metrics per node and per task, also expose ability to register non latency metrics in StreamsMetrics Also added different recording levels to Metrics. This is joint contribution from Eno Thereska and Aarti Gupta. from apache#1362 (comment) We can consider adding metrics for process / punctuate / commit rate at the granularity of each processor node in addition to the global rate mentioned above. This is very helpful in debugging. We can consider adding rate / total cumulated metrics for context.forward indicating how many records were forwarded downstream from this processor node as well. This is helpful in debugging. We can consider adding metrics for each stream partition timestamp. This is helpful in debugging. ## Besides the latency metrics, we can also add throughput latency in terms of source records consumed. More discussions here https://issues.apache.org/jira/browse/KAFKA-3715, KIP-104, KIP-105 Author: Eno Thereska <eno@confluent.io> Author: Aarti Gupta <aartiguptaa@gmail.com> Reviewers: Greg Fodor, Ismael Juma, Damian Guy, Guozhang Wang Closes apache#1446 from aartigupta/trunk

….file (apache#1446)

aartiguptaa added 4 commits May 29, 2016 20:42

enothereska reviewed Jun 2, 2016
View reviewed changes

aartigupta changed the title ~~Kafka Streams: add granular metrics per node~~ KAFKA-3715: add granular metrics per node Jun 2, 2016

guozhangwang mentioned this pull request Jun 2, 2016

KAFKA-3769 - KStream job spending 60% of time writing metrics #1447

Closed

jklukas reviewed Jun 6, 2016
View reviewed changes

aartigupta mentioned this pull request Jun 21, 2016

KAFKA-3769: Optimize metrics recording overhead #1490

Closed

Merged with trunk

d6e0b7b

Improved JavaDoc

e00f3c1

Merge remote-tracking branch 'apache-kafka/trunk' into trunk

8816d93

Reworked recording level enum as per Ismael's suggestion. Added more …

58ac0ee

…tests

Checkstyle fix

327baa7

Merge with trunk

9eec6cb

guozhangwang reviewed Jan 11, 2017

View reviewed changes

Intellij replace fix

bf4fa28

asfgit closed this in e43cf22 Jan 11, 2017

efeg pushed a commit to efeg/kafka that referenced this pull request May 29, 2024

Update README to add information to quick start about capacity.config…

1284681

….file (apache#1446)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-3715: add granular metrics per node #1446

KAFKA-3715: add granular metrics per node #1446

aartigupta commented May 30, 2016

aartigupta commented May 31, 2016

guozhangwang commented May 31, 2016

enothereska Jun 2, 2016

aartigupta Jun 2, 2016

enothereska commented Jun 2, 2016

jklukas Jun 6, 2016

enothereska commented Jun 6, 2016

aartigupta commented Jun 13, 2016 •

edited

Loading

gfodor commented Jun 21, 2016

guozhangwang commented Jun 21, 2016 •

edited

Loading

guozhangwang commented Jun 21, 2016

aartigupta commented Jun 23, 2016

theduderog commented Jun 29, 2016

enothereska commented Nov 28, 2016

asfbot commented Jan 10, 2017

asfbot commented Jan 10, 2017

asfbot commented Jan 10, 2017

asfbot commented Jan 10, 2017

asfbot commented Jan 11, 2017

asfbot commented Jan 11, 2017

asfbot commented Jan 11, 2017

asfbot commented Jan 11, 2017

asfbot commented Jan 11, 2017

asfbot commented Jan 11, 2017

asfbot commented Jan 11, 2017

asfbot commented Jan 11, 2017

asfbot commented Jan 11, 2017

guozhangwang Jan 11, 2017

asfbot commented Jan 11, 2017

asfbot commented Jan 11, 2017

enothereska commented Jan 11, 2017

enothereska commented Jan 11, 2017

guozhangwang commented Jan 11, 2017

asfbot commented Jan 11, 2017



		public NodeMetricsImpl(StreamsMetrics metrics, String
		name) {

KAFKA-3715: add granular metrics per node #1446

KAFKA-3715: add granular metrics per node #1446

Conversation

aartigupta commented May 30, 2016

Besides the latency metrics, we can also add throughput latency in terms of source records consumed.

aartigupta commented May 31, 2016

guozhangwang commented May 31, 2016

enothereska Jun 2, 2016

Choose a reason for hiding this comment

aartigupta Jun 2, 2016

Choose a reason for hiding this comment

enothereska commented Jun 2, 2016

jklukas Jun 6, 2016

Choose a reason for hiding this comment

enothereska commented Jun 6, 2016

aartigupta commented Jun 13, 2016 • edited Loading

gfodor commented Jun 21, 2016

guozhangwang commented Jun 21, 2016 • edited Loading

guozhangwang commented Jun 21, 2016

aartigupta commented Jun 23, 2016

theduderog commented Jun 29, 2016

enothereska commented Nov 28, 2016

asfbot commented Jan 10, 2017

asfbot commented Jan 10, 2017

asfbot commented Jan 10, 2017

asfbot commented Jan 10, 2017

asfbot commented Jan 11, 2017

asfbot commented Jan 11, 2017

asfbot commented Jan 11, 2017

asfbot commented Jan 11, 2017

asfbot commented Jan 11, 2017

asfbot commented Jan 11, 2017

asfbot commented Jan 11, 2017

asfbot commented Jan 11, 2017

asfbot commented Jan 11, 2017

guozhangwang Jan 11, 2017

Choose a reason for hiding this comment

asfbot commented Jan 11, 2017

asfbot commented Jan 11, 2017

enothereska commented Jan 11, 2017

enothereska commented Jan 11, 2017

guozhangwang commented Jan 11, 2017

asfbot commented Jan 11, 2017

aartigupta commented Jun 13, 2016 •

edited

Loading

guozhangwang commented Jun 21, 2016 •

edited

Loading