Skip to content

util/retry: create dynamic retry mechanism#162104

Open
kev-cao wants to merge 1 commit intocockroachdb:masterfrom
kev-cao:util/dynamic-retry
Open

util/retry: create dynamic retry mechanism#162104
kev-cao wants to merge 1 commit intocockroachdb:masterfrom
kev-cao:util/dynamic-retry

Conversation

@kev-cao
Copy link
Contributor

@kev-cao kev-cao commented Jan 30, 2026

This commit creates a dynamic retry mechanism wrapper around retry.Retry. This mechanism allows for multiple retry policies that the caller can switch between after each attempt.

Epic: None

Release note: None

@kev-cao kev-cao requested review from a team and andrew-r-thomas and removed request for a team January 30, 2026 19:17
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@kev-cao kev-cao requested a review from dt January 30, 2026 19:17
@kev-cao
Copy link
Contributor Author

kev-cao commented Jan 30, 2026

To see this in action, check out:

const singletonState = "_"
r := retry.NewDynamicRetry[roachpb.RowCount](
"restore", &execCtx.ExecCfg().Settings.SV,
).WithStates(map[string]retry.Options{
singletonState: retryOpts,
}).WithStrategy(
func(
d *retry.DynamicRetryState, res roachpb.RowCount, err error,
) retry.RetryTransition {
if errors.HasType(err, &kvpb.InsufficientSpaceError{}) {
return retry.FastFail(jobs.MarkPauseRequestError(errors.UnwrapAll(err)))
}
if shouldFastFailRestore(err) {
return retry.FastFail(jobs.MarkAsPermanentJobError(err))
}
// If we are draining, it is unlikely we can start a
// new DistSQL flow. Exit with a retryable error so
// that another node can pick up the job.
if execCtx.ExecCfg().JobRegistry.IsDraining() {
return retry.FastFail(jobs.MarkAsRetryJobError(
errors.Wrapf(err, "job encountered retryable error on draining node"),
))
}
var currProgress jobspb.RestoreFrontierEntries = resumer.job.
Progress().Details.(*jobspb.Progress_Restore).Restore.Checkpoint
if !currProgress.Equal(prevPersistedSpans) {
prevPersistedSpans = currProgress
// If the previous persisted spans are different than the current, it
// implies that further progress has been persisted.
return retry.Reset()
}
if d.CurrentStateAttempt() >= maxRestoreRetryFastFail &&
resumer.job.FractionCompleted() <= progThreshold {
return retry.FastFail(jobs.MarkAsPermanentJobError(
errors.Wrap(err, "restore job exhausted max retries without making progress"),
))
}
return retry.Continue()
},
).OnContinue(
func(d *retry.DynamicRetryState, res roachpb.RowCount, err error) error {
if logThrottler.ShouldProcess(crtime.NowMono()) {
// We throttle the logging of errors to the jobs messages table to avoid
// flooding the table during the hot loop of a retry.
if err := execCtx.ExecCfg().InternalDB.Txn(ctx, func(ctx context.Context, txn isql.Txn) error {
return resumer.job.Messages().Record(
ctx, txn, "error", fmt.Sprintf("restore encountered error: %v", err),
)
}); err != nil {
log.Dev.Warningf(ctx, "failed to record job error message: %v", err)
}
}
testingKnobs := execCtx.ExecCfg().BackupRestoreTestingKnobs
if testingKnobs != nil && testingKnobs.RunAfterRetryIteration != nil {
if err := testingKnobs.RunAfterRetryIteration(err); err != nil {
return err
}
}
return nil
},
)
res, err := r.DoCtx(
ctx,
func(ctx context.Context) (roachpb.RowCount, error) {
return restore(
ctx,
execCtx,
backupManifests,
backupLocalityInfo,
endTime,
dataToRestore,
resumer,
encryption,
kmsEnv,
)
},
"_",
)
if err != nil {
if r.LastTransition() == retry.TransitionFastFail {
// Failure was a fast-fail, return the error as is.
return res, err
}
// Since the restore was able to make some progress before exhausting the
// retry counter, we will pause the job and allow the user to determine
// whether or not to resume the job or discard all progress and cancel.
return res, jobs.MarkPauseRequestError(errors.Wrap(err, "exhausted retries"))
}
return res, nil

This doesn't use the multiple states functionality, but it does take advantage of the ability to reset the retry mechanism or fast fail.

The purpose of this mechanism it to facilitate the work for #159388, namely the approach outlined here.

Comment on lines 88 to 99
func NewState(state string) newState {
return newState{state: state}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be so much more succinct with Rust enums, wouldn't need the separate TransitionKind or all this boilerplate with interfaces 😢

@kev-cao kev-cao force-pushed the util/dynamic-retry branch 5 times, most recently from 8c1285e to 5d5ba0c Compare January 30, 2026 22:23
@github-actions
Copy link

Potential Bug(s) Detected

The three-stage Claude Code analysis has identified potential bug(s) in this PR that may warrant investigation.

Next Steps:
Please review the detailed findings in the workflow run.

Note: When viewing the workflow output, scroll to the bottom to find the Final Analysis Summary.

After you review the findings, please tag the issue as follows:

  • If the detected issue is real or was helpful in any way, please tag the issue with O-AI-Review-Real-Issue-Found
  • If the detected issue was not helpful in any way, please tag the issue with O-AI-Review-Not-Helpful

@github-actions github-actions bot added the o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. label Jan 30, 2026
@kev-cao kev-cao force-pushed the util/dynamic-retry branch from 5d5ba0c to 9df0f8e Compare January 30, 2026 22:27
@kev-cao
Copy link
Contributor Author

kev-cao commented Jan 30, 2026

This commit creates a dynamic retry mechanism wrapper around
`retry.Retry`. This mechanism allows for multiple retry policies that
the caller can switch between after each attempt.

Epic: None

Release note: None
@kev-cao kev-cao force-pushed the util/dynamic-retry branch from 9df0f8e to a017b37 Compare January 30, 2026 22:47
@kev-cao kev-cao added O-AI-Review-Real-Issue-Found AI reviewer found real issue and removed o-AI-Review-Potential-Issue-Detected AI reviewer found potential issue. Never assign manually—auto-applied by GH action only. labels Feb 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

O-AI-Review-Real-Issue-Found AI reviewer found real issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants