fix(lfs): extend GitHub token refresh buffer + add LFS fetch timeout#322
Merged
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d42735e155
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
LFS snapshot jobs were observed failing with repeated 'Bad credentials' from the GitHub LFS batch API, with single jobs churning for 12-26 hours before giving up. Root cause: a GitHub App installation token has a fixed 1 h server-side TTL, and the TokenManager cache served tokens until they had only 5 m of validity remaining. An LFS fetch that started near a cache boundary and ran longer than 5 m would exhaust its token mid-flight; git-lfs's internal batch retries then re-used the same expired token, producing a retry storm that ran until something else killed the subprocess. Two coordinated changes: - internal/githubapp/config.go: RefreshBuffer 5m -> 30m. Every token handed out now has at least 30 m of validity remaining. - internal/gitclone/manager.go, internal/strategy/git/snapshot.go: new LFSFetchTimeout (default 25m) wrapping 'git lfs fetch'. Bounds the retry-storm runaway and keeps subprocess lifetime below RefreshBuffer so a baked-in token can't expire mid-fetch. The 25m < 30m invariant is the airtight piece: the longest possible LFS subprocess (25 m) is shorter than the minimum remaining validity of any token it could be handed (30 m). Co-authored-by: Amp <amp@ampcode.com> Amp-Thread-ID: https://ampcode.com/threads/T-019e805f-48b7-7594-9d46-415a44d2e1c5
d42735e to
9575ac8
Compare
git-lfs spawns transfer helpers that inherit our stdout/stderr pipes, so killing only the top-level git process can leave CombinedOutput blocked indefinitely. Same pattern used by manager.go for clone/fetch. Amp-Thread-ID: https://ampcode.com/threads/T-019e805f-48b7-7594-9d46-415a44d2e1c5 Co-authored-by: Amp <amp@ampcode.com>
Contributor
Author
|
Good catch — fixed in 1cb340a. Applied the same |
Amp-Thread-ID: https://ampcode.com/threads/T-019e805f-48b7-7594-9d46-415a44d2e1c5 Co-authored-by: Amp <amp@ampcode.com>
alecthomas
approved these changes
Jun 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
LFS snapshot jobs have been failing with repeated
Bad credentialsfrom theGitHub LFS batch API. Individual jobs have been observed running for 12–26
hours before something finally killed them:
Root cause
GitHub App installation tokens have a fixed 1 h server-side TTL (not
configurable). The
TokenManagercache served tokens until they had onlyRefreshBuffer = 5mof validity remaining. That meant an LFS fetch startingnear a cache boundary could be handed a token with as little as ~5 minutes of
life left. If the fetch ran longer than that, the token expired mid-flight.
git-lfs's batch endpoint then retried with the same expired token (it's
baked into
credential.helperas a literal), turning every retry into a 401,and there was no per-subprocess timeout to stop the retry storm.
Fix
Two coordinated changes that make
lfsFetchTimeout < RefreshBufferanairtight invariant:
internal/githubapp/config.go— bumpRefreshBufferfrom5mto30m. Every token handed out now has at least 30 m of validity remaining.internal/strategy/git/snapshot.go— wrapgit lfs fetchincontext.WithTimeout(ctx, 25*time.Minute). Bounds the retry-stormrunaway and guarantees no subprocess can outlive the token baked into
its environment.
The 25 m < 30 m invariant is the airtight piece: the longest possible LFS
subprocess is shorter than the minimum remaining validity of any token it
could be handed, so a "Bad credentials" from token expiry should now be
impossible by construction on the LFS path.
Why not a refresh-mid-subprocess approach?
That alternative (file-backed credential helper + background refresh
goroutine) is much more invasive — new shell-form helper, temp-file
lifecycle, refresh goroutine, symlink-safety, ~150 LOC + tests. It only
buys us coverage of subprocesses that legitimately run longer than 1 h,
which our metrics don't show.