Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-9306: The consumer must close KafkaConsumerMetrics #7839

Merged
merged 2 commits into from
Dec 19, 2019

Conversation

cmccabe
Copy link
Contributor

@cmccabe cmccabe commented Dec 17, 2019

No description provided.

Copy link
Contributor

@soondenana soondenana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Colin for finding this. Left one comment.

Copy link
Contributor

@srpanwar-confluent srpanwar-confluent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signing off from testing side of things. Verified that this change fixes the issue.

@raymondng
Copy link
Contributor

LGTM. Thanks @cmccabe for the fix!

@cmccabe
Copy link
Contributor Author

cmccabe commented Dec 18, 2019

I pared this back to a more minimalistic fix for now, so that we can get something that can address the memory leak we spotted in 2.4. This will be easy to backport. I think we should do a better fix later which makes Metrics#close work as expected, but let's save that for trunk.

@hachikuji
Copy link
Contributor

Seems the main problem here is the lambda inside KafkaConsumerMetrics. Through this, we end up retaining the full Metrics object and everything reachable from it. I'm a bit puzzled though why only this lambda is leaking and not the ones inside ConsumerCoordinatorMetrics. Any clue about that?

@hachikuji
Copy link
Contributor

I think I see what's going on. Both KafkaConsumerMetrics and SelectorMetrics are internally using the same mbean. When the JmxReporter gets closed during Metrics.close, this mbean is unregistered. After Metrics is closed, the selector is closed and this goes through the individual metric unregistration logic in SelectorMetrics.close(). Ultimately this goes through JmxReporter.metricRemoval, which has the following logic:

            KafkaMbean mbean = removeAttribute(metric, mBeanName);
            if (mbean != null) {
                if (mbean.metrics.isEmpty()) {
                    unregister(mbean);
                    mbeans.remove(mBeanName);
                } else
                    reregister(mbean);
            }

The reregister is the key. Once all the selector metrics have been unregistered, the mbean will still have the 4 metrics added from KafkaConsumerMetrics and it will remain registered after the consumer closes. Ironically, it would have worked if SelectorMetrics did not explicitly deregister all of its metrics. This also explains why the bug fix here works. Since we are explicitly removing the 4 metrics, the mbean will ultimately not be re-registered. A simpler fix would probably be to make sure that the Metrics object gets closed last.

This bug seems to be the result of a lot of general sloppiness in the metrics library and the jmx reporter. We should probably file a JIRA to revisit these apis to make it harder for us to make this kind of mistake. We can improve testing as well probably to check for dangling mbeans.

@cmccabe
Copy link
Contributor Author

cmccabe commented Dec 18, 2019

@hachikuji : great analysis.

I'm thinking we get this current PR in for future dot releases, and apply #7851 to trunk.

Can I get an LGTM?

Copy link
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with the current patch. Can you add a test case? I think it would be straightforward to add an assertion to one of the test cases in KafkaConsumerTest which verifies that all mbeans have been unregistered.

@cmccabe
Copy link
Contributor Author

cmccabe commented Dec 18, 2019

added a test

Copy link
Contributor

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the fix.

@cmccabe cmccabe merged commit 7e36865 into apache:trunk Dec 19, 2019
asfgit pushed a commit that referenced this pull request Dec 19, 2019
Reviewers: Vikas Singh <vikas@confluent.io>, Jason Gustafson <jason@confluent.io>, Shailesh Panwar <spanwar@confluent.io>
(cherry picked from commit 7e36865)
@cmccabe
Copy link
Contributor Author

cmccabe commented Dec 19, 2019

Backported to 2.4 in case we do a dot release on that branch.

cmccabe added a commit to confluentinc/kafka that referenced this pull request Dec 19, 2019
Reviewers: Vikas Singh <vikas@confluent.io>, Jason Gustafson <jason@confluent.io>, Shailesh Panwar <spanwar@confluent.io>
(cherry picked from commit 7e36865)
(cherry picked from commit eb20a0d)
qq619618919 pushed a commit to qq619618919/kafka that referenced this pull request May 12, 2020
Reviewers: Vikas Singh <vikas@confluent.io>, Jason Gustafson <jason@confluent.io>, Shailesh Panwar <spanwar@confluent.io>
(cherry picked from commit 7e36865)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants