Skip to content

feat(replacer): expose --max-poll-interval-ms and --health-check-file into the CLI#7961

Open
aldy505 wants to merge 3 commits into
getsentry:masterfrom
aldy505:feat/replacer-arroyo-flags
Open

feat(replacer): expose --max-poll-interval-ms and --health-check-file into the CLI#7961
aldy505 wants to merge 3 commits into
getsentry:masterfrom
aldy505:feat/replacer-arroyo-flags

Conversation

@aldy505
Copy link
Copy Markdown
Collaborator

@aldy505 aldy505 commented May 23, 2026

Currently there are no way in self-hosted to be able to configure many things on snuba replacer. This PR adds --max-poll-interval-ms and --health-check-file.

@aldy505 aldy505 requested a review from a team as a code owner May 23, 2026 08:11
Comment thread snuba/cli/replacer.py
Comment on lines +113 to +117
consumer_config = {
"max.poll.interval.ms": max_poll_interval_ms,
}
if max_poll_interval_ms < 45000:
consumer_config["session.timeout.ms"] = max_poll_interval_ms
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The default value for --max-poll-interval-ms unconditionally overrides the Kafka consumer's max.poll.interval.ms, reducing it from 5 minutes to 30 seconds for all deployments.
Severity: HIGH

Suggested Fix

Change the default value of the max_poll_interval_ms argument to None. Only apply the max.poll.interval.ms setting to the Kafka consumer configuration if the value is not None, following the existing pattern in snuba/consumers/consumer_builder.py.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: snuba/cli/replacer.py#L113-L117

Potential issue: The `replacer` CLI unconditionally applies a default
`max.poll.interval.ms` of 30000ms (30 seconds) to the Kafka consumer configuration. This
overrides `librdkafka`'s default of 300000ms (5 minutes) for all existing deployments
that do not specify the new `--max-poll-interval-ms` flag. Because `session.timeout.ms`
must be less than or equal to this value, it is also reduced. If a replacement operation
takes longer than 30 seconds, which is common for large replacements, the consumer will
be removed from its consumer group, leading to rebalancing loops in production.

Did we get this right? 👍 / 👎 to inform future reviews.

Comment thread snuba/cli/replacer.py
Comment on lines +116 to +117
if max_poll_interval_ms < 45000:
consumer_config["session.timeout.ms"] = max_poll_interval_ms
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Setting session.timeout.ms equal to max_poll_interval_ms is risky, as it leaves no buffer for processing delays and can cause frequent, unnecessary consumer rebalances.
Severity: MEDIUM

Suggested Fix

Decouple session.timeout.ms from max_poll_interval_ms. Set session.timeout.ms to a static, reasonable value (e.g., the default 45000ms) that is independent of max_poll_interval_ms, while still ensuring the constraint session.timeout.ms <= max_poll_interval_ms is met. This provides a buffer for processing and prevents instability from transient delays.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: snuba/cli/replacer.py#L116-L117

Potential issue: The code sets the Kafka consumer's `session.timeout.ms` to be equal to
`max_poll_interval_ms` when the latter is below 45 seconds. This conflates two distinct
timeouts: session liveness and message processing time. By making them equal (e.g., at
the default of 30 seconds), there is no tolerance for transient delays like GC pauses,
network jitter, or I/O latency. This can cause the Kafka broker to prematurely consider
the consumer dead, triggering frequent and unnecessary group rebalances, which degrades
performance and stability. This pattern is explicitly marked as a "HACK" in another part
of the codebase.

Also affects:

  • snuba/consumers/consumer_builder.py:175-176

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants