fix(lfs): extend GitHub token refresh buffer + add LFS fetch timeout by jrobotham-square · Pull Request #322 · block/cachew

jrobotham-square · 2026-06-01T01:12:22Z

Problem

LFS snapshot jobs have been failing with repeated Bad credentials from the
GitHub LFS batch API. Individual jobs have been observed running for 12–26
hours before something finally killed them:

git lfs fetch: Fetching reference refs/heads/main
batch response: Bad credentials
batch response: Bad credentials
... x6
error: failed to fetch some objects from '.../info/lfs'
: exit status 2, elapsed=25h58m40s

Root cause

GitHub App installation tokens have a fixed 1 h server-side TTL (not
configurable). The TokenManager cache served tokens until they had only
RefreshBuffer = 5m of validity remaining. That meant an LFS fetch starting
near a cache boundary could be handed a token with as little as ~5 minutes of
life left. If the fetch ran longer than that, the token expired mid-flight.
git-lfs's batch endpoint then retried with the same expired token (it's
baked into credential.helper as a literal), turning every retry into a 401,
and there was no per-subprocess timeout to stop the retry storm.

Fix

Two coordinated changes that make lfsFetchTimeout < RefreshBuffer an
airtight invariant:

internal/githubapp/config.go — bump RefreshBuffer from 5m to
30m. Every token handed out now has at least 30 m of validity remaining.
internal/strategy/git/snapshot.go — wrap git lfs fetch in
context.WithTimeout(ctx, 25*time.Minute). Bounds the retry-storm
runaway and guarantees no subprocess can outlive the token baked into
its environment.

The 25 m < 30 m invariant is the airtight piece: the longest possible LFS
subprocess is shorter than the minimum remaining validity of any token it
could be handed, so a "Bad credentials" from token expiry should now be
impossible by construction on the LFS path.

Why not a refresh-mid-subprocess approach?

That alternative (file-backed credential helper + background refresh
goroutine) is much more invasive — new shell-form helper, temp-file
lifecycle, refresh goroutine, symlink-safety, ~150 LOC + tests. It only
buys us coverage of subprocesses that legitimately run longer than 1 h,
which our metrics don't show.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d42735e155

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

LFS snapshot jobs were observed failing with repeated 'Bad credentials' from the GitHub LFS batch API, with single jobs churning for 12-26 hours before giving up. Root cause: a GitHub App installation token has a fixed 1 h server-side TTL, and the TokenManager cache served tokens until they had only 5 m of validity remaining. An LFS fetch that started near a cache boundary and ran longer than 5 m would exhaust its token mid-flight; git-lfs's internal batch retries then re-used the same expired token, producing a retry storm that ran until something else killed the subprocess. Two coordinated changes: - internal/githubapp/config.go: RefreshBuffer 5m -> 30m. Every token handed out now has at least 30 m of validity remaining. - internal/gitclone/manager.go, internal/strategy/git/snapshot.go: new LFSFetchTimeout (default 25m) wrapping 'git lfs fetch'. Bounds the retry-storm runaway and keeps subprocess lifetime below RefreshBuffer so a baked-in token can't expire mid-fetch. The 25m < 30m invariant is the airtight piece: the longest possible LFS subprocess (25 m) is shorter than the minimum remaining validity of any token it could be handed (30 m). Co-authored-by: Amp <amp@ampcode.com> Amp-Thread-ID: https://ampcode.com/threads/T-019e805f-48b7-7594-9d46-415a44d2e1c5

git-lfs spawns transfer helpers that inherit our stdout/stderr pipes, so killing only the top-level git process can leave CombinedOutput blocked indefinitely. Same pattern used by manager.go for clone/fetch. Amp-Thread-ID: https://ampcode.com/threads/T-019e805f-48b7-7594-9d46-415a44d2e1c5 Co-authored-by: Amp <amp@ampcode.com>

jrobotham-square · 2026-06-01T01:18:03Z

Good catch — fixed in 1cb340a. Applied the same Setpgid + Cancel-with-syscall.Kill(-pid, SIGKILL) pattern the clone/fetch paths in manager.go already use, so the 25 m deadline reliably tears down git-lfs and its transfer helpers.

Amp-Thread-ID: https://ampcode.com/threads/T-019e805f-48b7-7594-9d46-415a44d2e1c5 Co-authored-by: Amp <amp@ampcode.com>

jrobotham-square requested a review from a team as a code owner June 1, 2026 01:12

jrobotham-square requested review from stuartwdouglas and removed request for a team June 1, 2026 01:12

chatgpt-codex-connector Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread internal/strategy/git/snapshot.go

jrobotham-square force-pushed the extend-token-buffer-add-lfs-timeout branch from d42735e to 9575ac8 Compare June 1, 2026 01:15

review: tighten lfs process-group comment

b99f263

Amp-Thread-ID: https://ampcode.com/threads/T-019e805f-48b7-7594-9d46-415a44d2e1c5 Co-authored-by: Amp <amp@ampcode.com>

alecthomas approved these changes Jun 1, 2026

View reviewed changes

jrobotham-square merged commit 82f252d into main Jun 1, 2026
8 checks passed

jrobotham-square deleted the extend-token-buffer-add-lfs-timeout branch June 1, 2026 01:32

jrobotham-square mentioned this pull request Jun 1, 2026

fix(gitclone): refresh GitHub App token mid-subprocess via file-based credential helper #321

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(lfs): extend GitHub token refresh buffer + add LFS fetch timeout#322

fix(lfs): extend GitHub token refresh buffer + add LFS fetch timeout#322
jrobotham-square merged 3 commits into
mainfrom
extend-token-buffer-add-lfs-timeout

jrobotham-square commented Jun 1, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

jrobotham-square commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jrobotham-square commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix

Why not a refresh-mid-subprocess approach?

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

jrobotham-square commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jrobotham-square commented Jun 1, 2026 •

edited

Loading