Skip to content

fix(uffd): add retry with exponential backoff on source.Slice() error#2389

Merged
jakubno merged 4 commits intomainfrom
chore/add-retries-to-uffd
Apr 14, 2026
Merged

fix(uffd): add retry with exponential backoff on source.Slice() error#2389
jakubno merged 4 commits intomainfrom
chore/add-retries-to-uffd

Conversation

@jakubno
Copy link
Copy Markdown
Member

@jakubno jakubno commented Apr 14, 2026

Transient Slice errors (network blips, temporary GCS/S3 failures) previously caused immediate sandbox termination. Retry up to 3 times with exponential backoff (50ms-500ms + jitter) before signaling uffd exit, giving transient errors a chance to recover.

jakubno added 2 commits April 14, 2026 09:43
…s in faultPage

Transient Slice errors (network blips, temporary GCS/S3 failures) previously
caused immediate sandbox termination. Retry up to 3 times with exponential
backoff (50ms-500ms + jitter) before signaling uffd exit, giving transient
errors a chance to recover.
The constant represents the number of retries, not total attempts. Adjusted
the loop range and log fields to be consistent with the new naming.
@cursor
Copy link
Copy Markdown

cursor bot commented Apr 14, 2026

PR Summary

Medium Risk
Changes the page-fault data path to add timed retries, which can affect sandbox startup latency and failure behavior under load. The retry loop also introduces timers and jitter per fault, so mis-tuned backoff could amplify resource usage if many faults hit transient errors simultaneously.

Overview
Adds a retry loop around source.Slice() when faulting in pages via UFFD: on transient read errors it now retries up to 3 times with exponential backoff (50ms–500ms) plus jitter, and only then signals failure/exit, logging attempt counts and returning an error that includes the total attempts.

Reviewed by Cursor Bugbot for commit 5fa2cb0. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues found in the retry loop in userfaultfd.go. Line 376: time.After creates a timer not cancelled when ctx.Done fires, leaking it until expiry. Use time.NewTimer and call timer.Stop on context cancellation. Line 369: rand.Int63n panics when the argument is 0 - safe with current 50ms base delay but fragile if sliceRetryBaseDelay is reduced below 2ns. Guard with: if half := int64(delay) / 2; half > 0.

jakubno added 2 commits April 14, 2026 10:47
When the context is cancelled (e.g. during shutdown), source.Slice()
fails immediately with a context error. Without this check, up to 4096
concurrent fault handlers would each log a misleading 'retrying' warning
before the select detects ctx.Done(). Break the retry loop early when
ctx.Err() is non-nil to avoid the log noise.
time.After creates a timer that lives until expiry even if the select
takes the ctx.Done path. Under high failure load with 4096 concurrent
fault handlers each retrying 3 times, this could produce many abandoned
timers. Use time.NewTimer and call Stop() on context cancellation to
release the timer immediately.
@jakubno jakubno force-pushed the chore/add-retries-to-uffd branch from 1ec86ea to 5fa2cb0 Compare April 14, 2026 10:54
@jakubno jakubno requested review from arkamar and removed request for ValentaTomas and dobrac April 14, 2026 11:17
@jakubno jakubno merged commit 090c9e6 into main Apr 14, 2026
45 checks passed
@jakubno jakubno deleted the chore/add-retries-to-uffd branch April 14, 2026 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants