Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rebalance.timeout.ms support (KIP-62) #1039

Closed
jeffwidman opened this issue Feb 7, 2017 · 8 comments
Closed

rebalance.timeout.ms support (KIP-62) #1039

jeffwidman opened this issue Feb 7, 2017 · 8 comments

Comments

@jeffwidman
Copy link

jeffwidman commented Feb 7, 2017

Does librdkafka support heartbeats in a background thread? (KIP-62)

Trying to minimize risk of a spinning consumer group if a message unexpectedly takes too long to process.

This landed in Kafka 0.10.1.0 as it required a protocol change to pass the rebalance timeout around.

@edenhill
Copy link
Contributor

edenhill commented Feb 7, 2017

Yes, librdkafka does all control plane stuff in the background and the application doesn't need to worry.
But the app should try to limit its per-message processing time under session.timeout.ms, otherwise if a rebalance happens while processing, another consumer might pick up the same message (depending on commit policy)

@jeffwidman
Copy link
Author

Yes, librdkafka does all control plane stuff in the background and the application does need to worry.

Did you mean "doesn't" need to worry?

But the app should try to limit its per-message processing time under session.timeout.ms, otherwise if a rebalance happens while processing, another consumer might pick up the same message

Hmm... According to KIP-62, it looks like the rebalance timeout is actually the new limit for per-message processing time. session.timeout.ms can be set much lower in the new design because it's a background heartbeat for catching crashed consumers, and it's fine if per-message processing takes longer than session.timeout.ms.

Am I misreading the KIP?

@edenhill
Copy link
Contributor

Did you mean "doesn't" need to worry?

Yes :)

You are right about KIP-62, so while librdkafka performs heartbeats in the background - which solves the initial problem - it does not yet support KIP-62 protocol changes - the rebalance timeout / max processing time.
So people with long message processing will still need to use a high and non-responsive session.timeout.ms

@jeffwidman
Copy link
Author

Thanks for the update. Looking forward to when support for KIP-62 / rebalance timeout is added.

@edenhill edenhill changed the title Does librdkafka support heartbeats in a background thread? (KIP-62) rebalance.timeout.ms support (KIP-62) Mar 8, 2017
@edenhill edenhill added this to the 0.9.5 milestone Mar 8, 2017
@edenhill edenhill removed this from the next feature milestone May 18, 2017
@pablasso
Copy link

pablasso commented Oct 2, 2017

@edenhill I'm interested in tackling the implementation of KIP-62 but it will be a bit of a challenge without context.

Could you give me some pointers on what/where needs to be changed? Any tips on how to test this would be greatly appreciated.

@edenhill
Copy link
Contributor

edenhill commented Nov 9, 2017

Since librdkafka already has a background thread (or a bunch) that takes care of all the actual broker communication, including heartbeats, there are only a couple of things that needs to be done in to support KIP-62:

  • Add max.poll.interval.ms config property. Trivial.
  • Send rebalanceTimeoutMs in JoinGroupRequest v1. The value used is max.poll.interval.ms. Trivial.
  • Enforce max.poll.interval.ms, this is not as straight forward as in Java which only has a single poll() call. librdkafka has a multiple APIs to poll for messages (for different use cases) and they can be used simultaneously from different threads, so it is not really clear if a max poll is a poll from any user thread, or all. Also, for bindings like confluent-kafka-go that pulls messages from librdkafka and puts in a buffered Go channel (where they may reside for some time without the app processing), should we really use the time the messages were pulled from librdkafka, or the time the messages were handed to the application? (this is analogue to the auto offset store problem).

@edenhill
Copy link
Contributor

This is scheduled for v1.0.0

edenhill added a commit that referenced this issue Oct 13, 2018
Changed defaults:
 * session.timeout.ms = 10000
edenhill added a commit that referenced this issue Oct 22, 2018
Changed defaults:
 * session.timeout.ms = 10000
edenhill added a commit that referenced this issue Oct 22, 2018
Changed defaults:
 * session.timeout.ms = 10000
@edenhill
Copy link
Contributor

Now on master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants