feat(replacer): expose --max-poll-interval-ms and --health-check-file into the CLI#7961
feat(replacer): expose --max-poll-interval-ms and --health-check-file into the CLI#7961aldy505 wants to merge 3 commits into
Conversation
| consumer_config = { | ||
| "max.poll.interval.ms": max_poll_interval_ms, | ||
| } | ||
| if max_poll_interval_ms < 45000: | ||
| consumer_config["session.timeout.ms"] = max_poll_interval_ms |
There was a problem hiding this comment.
Bug: The default value for --max-poll-interval-ms unconditionally overrides the Kafka consumer's max.poll.interval.ms, reducing it from 5 minutes to 30 seconds for all deployments.
Severity: HIGH
Suggested Fix
Change the default value of the max_poll_interval_ms argument to None. Only apply the max.poll.interval.ms setting to the Kafka consumer configuration if the value is not None, following the existing pattern in snuba/consumers/consumer_builder.py.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: snuba/cli/replacer.py#L113-L117
Potential issue: The `replacer` CLI unconditionally applies a default
`max.poll.interval.ms` of 30000ms (30 seconds) to the Kafka consumer configuration. This
overrides `librdkafka`'s default of 300000ms (5 minutes) for all existing deployments
that do not specify the new `--max-poll-interval-ms` flag. Because `session.timeout.ms`
must be less than or equal to this value, it is also reduced. If a replacement operation
takes longer than 30 seconds, which is common for large replacements, the consumer will
be removed from its consumer group, leading to rebalancing loops in production.
Did we get this right? 👍 / 👎 to inform future reviews.
| if max_poll_interval_ms < 45000: | ||
| consumer_config["session.timeout.ms"] = max_poll_interval_ms |
There was a problem hiding this comment.
Bug: Setting session.timeout.ms equal to max_poll_interval_ms is risky, as it leaves no buffer for processing delays and can cause frequent, unnecessary consumer rebalances.
Severity: MEDIUM
Suggested Fix
Decouple session.timeout.ms from max_poll_interval_ms. Set session.timeout.ms to a static, reasonable value (e.g., the default 45000ms) that is independent of max_poll_interval_ms, while still ensuring the constraint session.timeout.ms <= max_poll_interval_ms is met. This provides a buffer for processing and prevents instability from transient delays.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: snuba/cli/replacer.py#L116-L117
Potential issue: The code sets the Kafka consumer's `session.timeout.ms` to be equal to
`max_poll_interval_ms` when the latter is below 45 seconds. This conflates two distinct
timeouts: session liveness and message processing time. By making them equal (e.g., at
the default of 30 seconds), there is no tolerance for transient delays like GC pauses,
network jitter, or I/O latency. This can cause the Kafka broker to prematurely consider
the consumer dead, triggering frequent and unnecessary group rebalances, which degrades
performance and stability. This pattern is explicitly marked as a "HACK" in another part
of the codebase.
Also affects:
snuba/consumers/consumer_builder.py:175-176
Currently there are no way in self-hosted to be able to configure many things on
snuba replacer. This PR adds--max-poll-interval-msand--health-check-file.