Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak while stopping an unknown partition #4669

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

emasab
Copy link
Contributor

@emasab emasab commented Apr 3, 2024

When stopping a partition that isn't known to the client a reference to that partition was kept in the fetchq of the same, leading to a memory leak.
Solved by purging the fetch and ops queue of the unknown partition when it's stopped

When stopping a partition that isn't known to the client a reference
to that partition was kept in the fetchq of the same partition,
leading to a memory leak.
Solved by purging the fetch and ops queue of the unknown partition
when it's stopped
@Quuxplusone
Copy link
Contributor

Quuxplusone commented Apr 4, 2024

In #4486 I wrote:

My employer has also tentatively bisected an issue to 8e20e1e. We see that if we have a groupconsumer, and the broker goes down (that is, we shut down the Kafka broker), and then comes back up, and then we try to shut down the groupconsumer, it simply hangs forever. This doesn't reproduce in 1.9.3rc2 or earlier, but it does reproduce in 2.0.0rc1 through 2.3.0. In fact, it appears not to reproduce in a83cadf but to start reproducing with 8e20e1e.

@emasab wrote:

@Quuxplusone in case it causes a hang, please try this fix and in case it's still happening tell me how to reproduce it.

Okay, I just tried it. My employer's test case is red with librdkafka v2.3.0, green with librdkafka v2.3.0-plus-#4667, but unfortunately remains red with librdkafka v2.3.0-plus-#4669.

That's not to say that #4669 is useless or fails to fix a real bug; it just doesn't fix our bug.

OTOH I also don't claim that #4669 is bug-free. Its new code that takes two mutex locks at once is clearly at risk of deadlock. It would benefit from a code comment explaining (or linking to) the project's lock-ordering policy.

Unfortunately I don't have a reproducer using only librdkafka code; we build several C++ layers on top of librdkafka.

@emasab
Copy link
Contributor Author

emasab commented Apr 5, 2024

@Quuxplusone thanks for testing it. The code respects the lock ordering policy, that is:

  1. rk main lock
  2. rkb lock
  3. rkt lock
  4. rktp lock

Is your failing test doing something specific, like deleting a topic or using the cooperative-sticky assignor?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants