Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-6504: Fix creation of a sensor to be specific to a metric group so it is not shared #4514

Merged
merged 4 commits into from
Feb 6, 2018

Conversation

rayokota
Copy link
Contributor

@rayokota rayokota commented Feb 2, 2018

This change ensures that when sensors are created, they are specific to a metric group. Previously the sensors were being shared between metric groups, causing incorrect metrics.

@rayokota
Copy link
Contributor Author

rayokota commented Feb 2, 2018

@rhauch and @ewencp , can you review this to see if it is ok?

@asfgit
Copy link

asfgit commented Feb 2, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-test-coverage/278/

Copy link
Contributor

@rhauch rhauch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! LGTM

@rhauch
Copy link
Contributor

rhauch commented Feb 2, 2018

@rayokota, we should probably backport this to all branches since 1.0.

Copy link
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. We have a test case which verifies some of the metrics, but apparently not this one. Maybe we can extend that test case?

@rayokota
Copy link
Contributor Author

rayokota commented Feb 2, 2018

@hachikuji , I added a unit test to check the "source-record-active-count" metric. However, the above bug does not cause this test to fail, because the metric name is fine but it is the sensor name that is not. Two metrics ("sink-record-active-count" and "source-record-active-count") were both sharing the sensor with the same name ("sink-record-active-count"). So to trigger the bug, I would need to write a test to invoke both the WorkerSinkTask and the WorkerSourceTask and show that the metrics "source-record-active-count" and "sink-record-active-count" were a combined sum of the both source and sink record active counts when they should not be. At least that is what I believe is going on based on my reading of the code. Do you still want such a test?

@@ -509,7 +509,7 @@ public SourceTaskMetricsGroup(ConnectorTaskId id, ConnectMetrics connectMetrics)
pollTime.add(metricGroup.metricName(registry.sourceRecordPollBatchTimeMax), new Max());
pollTime.add(metricGroup.metricName(registry.sourceRecordPollBatchTimeAvg), new Avg());

sourceRecordActiveCount = metricGroup.metrics().sensor("sink-record-active-count");
sourceRecordActiveCount = metricGroup.metrics().sensor("source-record-active-count");
Copy link
Contributor

@hachikuji hachikuji Feb 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reported bug might actually be the fact that we use metricGroup.metrics().sensor() instead of metricGroup.sensor(). Since we are not going through the MetricGroup, the sensor will not be unique across source tasks. That means when we update this sensor for one source task, we will update all the other source tasks as well. This seems to fit with what was reported in the JIRA. You can verify the bug by adding another metric group to the test case with a different taskId.

Looking over the code, it looks like WorkerSinkTask also suffers from the same problem. We should fix that as well and add test cases which verify sensor uniqueness. It would also be a good idea to add tests which verify that metrics and sensors are removed correctly for a group after closing the metric group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, the sensor creation was wrong. I fixed the set up of sensors and added unit tests that verify that the sensors were not unique before the fix, but unique after the fix.

assertEquals(1900.0, metrics.currentMetricValueAsDouble(group.metricGroup(), "poll-batch-max-time-ms"), 0.001d);
assertEquals(1450.0, metrics.currentMetricValueAsDouble(group.metricGroup(), "poll-batch-avg-time-ms"), 0.001d);
assertEquals(33.333, metrics.currentMetricValueAsDouble(group.metricGroup(), "source-record-poll-rate"), 0.001d);
assertEquals(1000, metrics.currentMetricValueAsDouble(group.metricGroup(), "source-record-poll-total"), 0.001d);
assertEquals(3.3333, metrics.currentMetricValueAsDouble(group.metricGroup(), "source-record-write-rate"), 0.001d);
assertEquals(100, metrics.currentMetricValueAsDouble(group.metricGroup(), "source-record-write-total"), 0.001d);
assertEquals(900.0, metrics.currentMetricValueAsDouble(group.metricGroup(), "source-record-active-count"), 0.001d);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can close the first group here and verify that the sensors/metrics are no longer registered? A similar check for the sink would be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I added code to close the first group

@asfgit
Copy link

asfgit commented Feb 6, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-test-coverage/358/

@asfgit
Copy link

asfgit commented Feb 6, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/kafka-pr-test-coverage/348/

Copy link
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the patch!

@hachikuji
Copy link
Contributor

@rayokota Before I merge, can you change the title and the description to match the updated patch?

@rayokota rayokota changed the title KAFKA-6504: Fix source task metric caused by copy-paste error KAFKA-6504: Fix creation of a sensor to be specific to a metric group so it is not shared Feb 6, 2018
@rayokota
Copy link
Contributor Author

rayokota commented Feb 6, 2018

Ok, changed the title and description. Thanks for the review!

@hachikuji hachikuji merged commit 7d70a42 into apache:trunk Feb 6, 2018
hachikuji pushed a commit that referenced this pull request Feb 6, 2018
…4514)

This change ensures that when sensors are created, they are unique to the metric group associated with the task that created them. Previously the sensors were being shared between task metric groups, causing incorrect metrics.

Reviewers: Randall Hauch <rhauch@gmail.com>, Jason Gustafson <jason@confluent.io>
hachikuji pushed a commit that referenced this pull request Feb 6, 2018
…4514)

This change ensures that when sensors are created, they are unique to the metric group associated with the task that created them. Previously the sensors were being shared between task metric groups, causing incorrect metrics.

Reviewers: Randall Hauch <rhauch@gmail.com>, Jason Gustafson <jason@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants