Skip to content

changefeedccl: reset retry backoff on highwater advance#164933

Merged
trunk-io[bot] merged 2 commits intocockroachdb:masterfrom
aerfrei:fix-164695
Mar 6, 2026
Merged

changefeedccl: reset retry backoff on highwater advance#164933
trunk-io[bot] merged 2 commits intocockroachdb:masterfrom
aerfrei:fix-164695

Conversation

@aerfrei
Copy link
Contributor

@aerfrei aerfrei commented Mar 5, 2026

Previously, the changefeed retry backoff only reset after running
error-free for a configurable duration (default 10 minutes via
changefeed.retry_backoff_reset).

Now the retry backoff also resets whenever the changefeed's highwater
mark (resolved timestamp) advances between retries. This directly
measures forward progress, so transient errors during rolling restarts
no longer leave the backoff unnecessarily elevated when the changefeed
is otherwise healthy.

The time-based reset remains as a fallback.

Fixes: #164695
Release note (bug fix): Changefeed retry backoff now resets when
the changefeed's resolved timestamp (highwater mark) advances between
retries, in addition to the existing time-based reset
(changefeed.retry_backoff_reset). This prevents transient rolling
restarts from causing changefeeds to fall behind because of
excessive backoff.

@trunk-io
Copy link
Contributor

trunk-io bot commented Mar 5, 2026

😎 Merged successfully - details.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aerfrei aerfrei force-pushed the fix-164695 branch 3 times, most recently from d0ea233 to 16754f1 Compare March 5, 2026 18:24
@aerfrei aerfrei marked this pull request as ready for review March 5, 2026 18:25
@aerfrei aerfrei requested a review from a team as a code owner March 5, 2026 18:25
@aerfrei aerfrei requested review from KeithCh and removed request for a team March 5, 2026 18:25
@aerfrei aerfrei force-pushed the fix-164695 branch 2 times, most recently from 4c07ade to df6219b Compare March 5, 2026 21:44
// BeforeCheckpoint, if set, is called at the start of
// checkpointJobProgress. Returning a non-nil error causes a retryable
// failure before the checkpoint is persisted.
BeforeCheckpoint func() error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda like the old name more cause it tells you what it does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was somewhat torn too, open to suggestions here. Specifically, let me know what you think about the context of the test I added. There, we're using this knob to count checkpoints (even though the first few times we don't raise an error), and the connection to checkpoints isn't clear from the name.

RaiseRetryableErrorOnCheckpoint is a bit wordy (and is still weird when we don't actually raise an error). The way it is now, though, it's not clear that the error will necessary be retryable, so there is that drawback.

Seems like it should be simple, but it's something of a hard needle to thread. Don't want to overthink it though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The knob's main purpose is to raise an error, which is why I having error mentioned. I'll say I like MaybeRaiseRetryableError the most, but I'll let you make the final call.

Copy link
Collaborator

@andyyang890 andyyang890 Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive by: Both the name and behavior for RaiseRetryableError have always bugged me because it's not at all descriptive where in the pipeline the retryable error is raised and I also don't like that the error you return for this hook isn't the one that actually gets returned to the changefeed (at that point you might as well just have the hook return a bool). I'm in favor of renaming (and I think naming it BeforeCheckpoint is fine), but if possible, I'd like it if we could also refactor the hook so that we just return the error instead of creating a different error that just says the hook name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andyyang890 ! I like that suggestion, Implemented that

aerfrei added 2 commits March 6, 2026 11:51
Rename the `RaiseRetryableError` testing knob to `BeforeCheckpoint` to
follow the existing `Before*` naming convention (e.g. `BeforeEmitRow`)
and to make it clear that the knob fires at the start of
`checkpointJobProgress`. The old name described what it did (raise a
retryable error) but not where it fired, making tests that relied on
this hook being at that time harder to read.

Also propagate the hook's returned error directly instead of wrapping it
in a new error with the hook name, so tests control the exact error the
changefeed sees.

Informs: cockroachdb#164695
Previously, the changefeed retry backoff only reset after running
error-free for a configurable duration (default 10 minutes via
`changefeed.retry_backoff_reset`).

Now the retry backoff also resets whenever the changefeed's highwater
mark (resolved timestamp) advances between retries. This directly
measures forward progress, so transient errors during rolling restarts
no longer leave the backoff unnecessarily elevated when the changefeed
is otherwise healthy.

The time-based reset remains as a fallback.

Fixes: cockroachdb#164695
Release note (bug fix): Changefeed retry backoff now resets when
the changefeed's resolved timestamp (highwater mark) advances between
retries, in addition to the existing time-based reset
(changefeed.retry_backoff_reset). This prevents transient rolling
restarts from causing changefeeds to fall behind because of
excessive backoff.
@aerfrei
Copy link
Contributor Author

aerfrei commented Mar 6, 2026

/trunk merge

TFTR!

@trunk-io trunk-io bot merged commit 7202cdf into cockroachdb:master Mar 6, 2026
28 checks passed
aerfrei added a commit to aerfrei/cockroach that referenced this pull request Mar 11, 2026
Put the retry backoff reset on highwater advance (from PR cockroachdb#164933)
behind a new cluster setting:
changefeed.reset_backoff_on_highwater_advance.enabled

The setting is enabled by default on master, preserving the current
behavior. This gating allows the change to be safely backported to
release branches where the setting can be disabled by default.

Informs: cockroachdb#164695
Release note: None
aerfrei added a commit to aerfrei/cockroach that referenced this pull request Mar 11, 2026
Put the retry backoff reset on highwater advance (from PR cockroachdb#164933)
behind a new cluster setting:
changefeed.reset_backoff_on_highwater_advance.enabled

The setting is enabled by default on master, preserving the current
behavior. This gating allows the change to be safely backported to
release branches where the setting can be disabled by default.

Informs: cockroachdb#164695
Release note: None
aerfrei added a commit to aerfrei/cockroach that referenced this pull request Mar 13, 2026
Put the retry backoff reset on highwater advance (from PR cockroachdb#164933)
behind a new cluster setting:
changefeed.reset_backoff_on_highwater_advance.enabled

The setting is enabled by default on master, preserving the current
behavior. This gating allows the change to be safely backported to
release branches where the setting can be disabled by default.

Informs: cockroachdb#164695
Release note: None
aerfrei added a commit to aerfrei/cockroach that referenced this pull request Mar 13, 2026
Put the retry backoff reset on highwater advance (from PR cockroachdb#164933)
behind a new cluster setting:
changefeed.reset_backoff_on_highwater_advance.enabled

The setting is enabled by default on master, preserving the current
behavior. This gating allows the change to be safely backported to
release branches where the setting can be disabled by default.

Informs: cockroachdb#164695
Release note: None
aerfrei added a commit to aerfrei/cockroach that referenced this pull request Mar 13, 2026
Put the retry backoff reset on highwater advance (from PR cockroachdb#164933)
behind a new cluster setting:
changefeed.reset_backoff_on_highwater_advance.enabled

The setting is enabled by default on master, preserving the current
behavior. This gating allows the change to be safely backported to
release branches where the setting can be disabled by default.

Informs: cockroachdb#164695
Release note: None
aerfrei added a commit to aerfrei/cockroach that referenced this pull request Mar 13, 2026
Put the retry backoff reset on highwater advance (from PR cockroachdb#164933)
behind a new cluster setting:
changefeed.reset_backoff_on_highwater_advance.enabled

The setting is enabled by default on master, preserving the current
behavior. This gating allows the change to be safely backported to
release branches where the setting can be disabled by default.

Informs: cockroachdb#164695
Release note: None
aerfrei added a commit to aerfrei/cockroach that referenced this pull request Mar 13, 2026
Put the retry backoff reset on highwater advance (from PR cockroachdb#164933)
behind a new cluster setting:
changefeed.reset_backoff_on_highwater_advance.enabled

The setting is enabled by default on master, preserving the current
behavior. This gating allows the change to be safely backported to
release branches where the setting can be disabled by default.

Informs: cockroachdb#164695
Release note: None
aerfrei added a commit to aerfrei/cockroach that referenced this pull request Mar 13, 2026
Put the retry backoff reset on highwater advance (from PR cockroachdb#164933)
behind a new cluster setting:
changefeed.reset_backoff_on_highwater_advance.enabled

The setting is enabled by default on master, preserving the current
behavior. This gating allows the change to be safely backported to
release branches where the setting can be disabled by default.

Informs: cockroachdb#164695
Release note: None
aerfrei added a commit to aerfrei/cockroach that referenced this pull request Mar 13, 2026
Put the retry backoff reset on highwater advance (from PR cockroachdb#164933)
behind a new cluster setting:
changefeed.reset_backoff_on_highwater_advance.enabled

The setting is enabled by default on master, preserving the current
behavior. This gating allows the change to be safely backported to
release branches where the setting can be disabled by default.

Informs: cockroachdb#164695
Release note: None
aerfrei added a commit to aerfrei/cockroach that referenced this pull request Mar 13, 2026
Put the retry backoff reset on highwater advance (from PR cockroachdb#164933)
behind a new cluster setting:
changefeed.reset_backoff_on_highwater_advance.enabled

The setting is enabled by default on master, preserving the current
behavior. This gating allows the change to be safely backported to
release branches where the setting can be disabled by default.

Informs: cockroachdb#164695
Release note: None
aerfrei added a commit to aerfrei/cockroach that referenced this pull request Mar 13, 2026
Put the retry backoff reset on highwater advance (from PR cockroachdb#164933)
behind a new cluster setting:
changefeed.reset_backoff_on_highwater_advance.enabled

The setting is enabled by default on master, preserving the current
behavior. This gating allows the change to be safely backported to
release branches where the setting can be disabled by default.

Informs: cockroachdb#164695
Release note: None
aerfrei added a commit to aerfrei/cockroach that referenced this pull request Mar 13, 2026
Put the retry backoff reset on highwater advance (from PR cockroachdb#164933)
behind a new cluster setting:
changefeed.reset_backoff_on_highwater_advance.enabled

The setting is enabled by default on master, preserving the current
behavior. This gating allows the change to be safely backported to
release branches where the setting can be disabled by default.

Informs: cockroachdb#164695
Release note: None
aerfrei added a commit to aerfrei/cockroach that referenced this pull request Mar 13, 2026
Put the retry backoff reset on highwater advance (from PR cockroachdb#164933)
behind a new cluster setting:
changefeed.reset_backoff_on_highwater_advance.enabled

The setting is enabled by default on master, preserving the current
behavior. This gating allows the change to be safely backported to
release branches where the setting can be disabled by default.

Informs: cockroachdb#164695
Release note: None
aerfrei added a commit to aerfrei/cockroach that referenced this pull request Mar 13, 2026
Put the retry backoff reset on highwater advance (from PR cockroachdb#164933)
behind a new cluster setting:
changefeed.reset_backoff_on_highwater_advance.enabled

The setting is enabled by default on master, preserving the current
behavior. This gating allows the change to be safely backported to
release branches where the setting can be disabled by default.

Informs: cockroachdb#164695
Release note: None
aerfrei added a commit to aerfrei/cockroach that referenced this pull request Mar 13, 2026
Put the retry backoff reset on highwater advance (from PR cockroachdb#164933)
behind a new cluster setting:
changefeed.reset_backoff_on_highwater_advance.enabled

The setting is enabled by default on master, preserving the current
behavior. This gating allows the change to be safely backported to
release branches where the setting can be disabled by default.

Informs: cockroachdb#164695
Release note: None
aerfrei added a commit to aerfrei/cockroach that referenced this pull request Mar 13, 2026
Put the retry backoff reset on highwater advance (from PR cockroachdb#164933)
behind a new cluster setting:
changefeed.reset_backoff_on_highwater_advance.enabled

The setting is enabled by default on master, preserving the current
behavior. This gating allows the change to be safely backported to
release branches where the setting can be disabled by default.

Informs: cockroachdb#164695
Release note: None
aerfrei added a commit to aerfrei/cockroach that referenced this pull request Mar 13, 2026
Put the retry backoff reset on highwater advance (from PR cockroachdb#164933)
behind a new cluster setting:
changefeed.reset_backoff_on_highwater_advance.enabled

The setting is enabled by default on master, preserving the current
behavior. This gating allows the change to be safely backported to
release branches where the setting can be disabled by default.

Informs: cockroachdb#164695
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

changefeedccl: reset retry backoff on highwater advance

4 participants